Reading view

There are new articles available, click to refresh the page.

Making empirical decisions about web access (here in 2026)

By: cks

16 March 2026 at 02:12

Recently, Denis Warburton wrote in a comment on my entry on how HTTP results today depend on what HTTP User-Agent you use:

Making decisions based on user-provided information is unwise in 2026. The originating ip address is the only source of "truth" ... and even then, that information needs to be further examined before discerning whether or not it is a valid piece of communication.

It's absolutely true that everything except the source IP address is under the control of an attacker (and it always has been), and in one sense you can't trust it. But this doesn't mean you can't use information that's under the attacker's control in making decisions about whether to allow access to something; instead, it means that you have to be thoughtful about how you use the information and what for.

In practice, web agents emit a lot of data in their HTTP headers and requests. Some of these signals are complicated, such as browser version numbers, and some of them require work to use, but this doesn't mean that there's no signal at all that can be derived from all of the data that a web agent emits. For example, consider a web agent that uses the HTTP User-Agent of:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

This web agent is telling you that it's claiming to be Googlebot. Under the right circumstances this can be a valuable signal of malfeasance and worth denying access.

Similarly, a web agent that emits user agent hints while its HTTP User-Agent is claiming to be an authentic version of Firefox 147 is giving you the signal that it's not an unaltered, standard version of Firefox, because standard versions of Firefox 147 don't do that. It's most likely something built on Chromium, but in any case you might decide that this signal means it is suspicious enough to be denied access. Neither the User-Agent nor the Sec-CH-UA headers create true facts to definitively identify the browser and both could be faked by the attacker, but the inconsistency is real.

What an attacker tells you (deliberately or accidentally) is a signal, and it's up to you to interpret and use that signal (which I think you should these days). This is an empirical thing, something that depends on the surrounding environment (for example, you have to interpret the attacker's signal in terms of its difference from the signals of legitimate visitors), what you're doing, and what you care about, but then security is always ultimately people, not math, even though tech loves to avoid this sort of empiricism (which is a bad thing).

As a pragmatic thing, it's usually easier to use attacker signals if you allow things by default rather than deny them by default. If you allow by default, your primary concern is false positives (legitimate visitors who are emitting signals you find too suspicious), rather than false negatives, because an attacker that wants to work hard enough can always obtain access. Conveniently, public web sites (such as Wandering Thoughts) are exactly such an allow by default environment, which is why these days I use a lot of signals here when deciding what to accept or block (including IP addresses and networks).

(If you need a deny by default environment with real security, you need to use something that attackers can't fake. IP addresses can be one option in the right circumstances, but they aren't the only one.)

(2 comments.)

On today's web, HTTP results depend on the HTTP User-Agent you use

Chris's Wiki :: blog

By: cks

14 March 2026 at 03:27

Back in the old days, search engines mostly crawled your sites with their regular, clearly identifying HTTP User-Agent headers, but once in a while they would switch up to fetching with a browser's User-Agent. What they were trying to detect was if you served one set of content to "Googlebot" but another set of content to "Firefox", and if you did they tended to penalize you; you were supposed to serve the same content to both, not SEO-bait to Googlebot and wall to wall ads to browsers. Googlebot identified itself as a standard courtesy, not so you could handle it differently.

Obviously those days are long over. It's now routine and fully accepted to serve different things to Googlebot and to regular browsers. Generally websites offer Googlebot more access and plain text, and browsers less access (even paywalls) and JavaScript encrusted content (leading to people setting their User-Agent to Googlebot to bypass paywalls). Since people give Googlebot special access, people impersonate it and other well accepted crawlers and other people (like me) block that impersonation.

This is part of an increasingly common general pattern, which is that different HTTP User-Agents get different results for the same URL. Especially, some HTTP User-Agents will get errors, HTTP redirections, or challenge pages, and other User-Agents won't; instead they'll get the real content. What this means in concrete terms is it's increasingly bad to take the results from one HTTP User-Agent and assume they apply for another. This isn't just me and Wandering Thoughts; for example, if a site has a standard configuration of Anubis, having a User-Agent that includes 'Mozilla' will cause you to get a challenge page instead of the actual page (cf).

(One of the amusing effects of this is what it does to 'link previews', which require the website displaying the preview to fetch a copy of the URL from the original site. On the Fediverse, fairly often the link preview I see is just some sort of a challenge page.)

In practice, you're probably reasonably safe if you're doing close variations of what's fundamentally the same distinctive User-Agent. But you're living dangerously if you try this with browser-like User-Agent values, either two different ones or a browser-like User-Agent and a distinctive non-browser one, because those are the ones that are most frequently forged and abused by covert web crawlers and other malware. Everyone who wants to look normal is imitating a browser, which means looking like a browser is a bad idea today.

Unfortunately, however bad an idea it is, people seem to keep trying fetches with multiple User-Agent header values and then taking a result from one User-Agent and using it in the context of another. Especially, feed reader companies seem to do it, first Feedly and now Inoreader.

(2 comments.)

If there are URLs in your HTTP User-Agent, they should exist and work

Chris's Wiki :: blog

By: cks

9 March 2026 at 02:18

One of the things people put in their HTTP User-Agent header for non-browser software is a URL for their software, project, or whatever (I'm all for this). This is a a good thing, because it allows people operating web servers to check out who and what you are and decide for themselves if they're going to allow it. Increasingly (and partly for social reasons), I block many 'generic' User-Agent values that come to my attention, for example through their volume.

(I don't block all of them, but if your User-Agent shows up and I can't figure out what it is and whether or not it's legitimate and used by real people, that's probably a block.)

However, there's an important and obvious thing about any URLs in your HTTP User-Agent, which is that they should actually work. The domain or host should exist, the URL should exist in the web server, and the URL's contents should actually explain the software, project, or organization involved. Plus, if you use a HTTPS website, the TLS certificate should be valid.

(A related thing is a generic URL that doesn't give me anything to go on. For example, your URL on a code forge, and either it's not obvious which one of your repositories is doing things or you don't have any public repositories.)

For me, a non-working URL is much more suspicious than a missing URL. HTTP User-Agents without URLs are reasonably common (especially in feed readers), so I don't find them immediately suspicious. Non-working URLs in mysterious User-Agents certainly look like you're attempting to distract me with the appearance of a proper web agent but without the reality of it. If a User-Agent with such a non-working URL comes to my attention, I'm very likely to block it in some way (unless it's very clear that it's a legitimate program used by real people, and it merely has bad habits with its User-Agent).

You would think that people wouldn't make this sort of mistake, but I regret to say that I've seen it repeatedly, in all of the variations. One interesting version I've seen is User-Agent strings with the various 'example.<TLD>' domains in their URLs. I suspect that this comes from software that has some sort of 'operator URL' setting and provides a default value if you don't set one explicitly. I've also seen .lan and .local URLs in User-Agents, which takes somewhat more creativity.

As usual, my view is that software shouldn't provide this sort of default value; instead, it should refuse to work until you configure your own value. However, this makes it slightly more annoying to use, so it will be less popular than more accommodating software. Of course, we can change that calculation by blocking everything that mentions 'example.com', 'example.org', 'example.net' and so on in its User-Agent.

(One comment.)

The importance of limiting syndication feed requests in some way

Chris's Wiki :: blog

By: cks

22 February 2026 at 01:27

People sometimes wonder why I care so much about HTTP conditional GETs and rate limiting for syndication feed fetchers. There are multiple reasons, including social reasons to establish norms, but one obvious one is transfer volumes. To illustrate that, I'll look at the statistics for yesterday for feed fetches of the main syndication feed for Wandering Thoughts.

Yesterday there were 7492 feed requests that got HTTP 200 responses, 9419 feed requests that got HTTP 304 Not Modified responses, and 11941 requests that received HTTP 429 responses. The HTTP 200 responses amounted to about 1.26 GBytes, with the average response size being 176 KBytes. This average response size is actually a composite; typical compressed syndication feed responses are on the order of 160 KBytes, while uncompressed ones are on the order of 540 KBytes (but there look to have been only 313 of them, which is fortunate; even still they're 12% of the transfer volume).

If feed readers didn't do any conditional GETs and I didn't have any rate limiting (and all of the requests that got HTTP 429s would still have been made), the additional feed requests would have amounted to about another 3.5 GBytes of responses sent out to people. Obviously feed readers did do conditional GETS, and 66% of their non rate limited requests were successful conditional GETs. A HTTP 200 response ratio of 44% is probably too pessimistic once we include rate limited requests, so as an extreme approximation we'll guess that 33% of the rate limited requests would have received HTTP 200 responses with a changed feed; that would amount to another 677 MBytes of response traffic (which is less than I expected). If we use the 44% HTTP 200 ratio, it's still only 903 MBytes more.

(This 44% rate may sound high but my syndication feed changes any time someone leaves a comment on a recent entry, because the syndication feed of entries includes a comment count for every entry.)

Another statistic is that 41% of syndication feed requests yesterday got HTTP 429 responses. The most prolific single IP address received 950 HTTP 429s, which maps to an average request interval of less than two minutes between requests. Another prolific source made 779 requests, which again amounts to an interval of just less than two minutes. There are over 20 single IPs that received more than 96 HTTP 429 responses (which corresponds to an average interval of 15 minutes). There is a lot of syndication feed fetching software out there that is fetching quite frequently.

(Trying to figure out how many HTTP 429 sources did conditional requests is too complex with my current logs, since I don't directly record that information.)

You can avoid the server performance impact of lots of feed fetching by arranging to serve syndication feeds from static files instead of a dynamic system (and then you can limit how frequently you update those files, effectively forcing a maximum number of HTTP 200 fetches per time interval on anything that does conditional GETs). You can't avoid the bandwidth effects, and serving from static files generally leaves you with only modest tools for rate limiting.

PS: The syndication feeds for Wandering Thoughts are so big because I've opted to default to 100 entries in them, but I maintain you should be able to do this sort of thing without having your bandwidth explode.

(7 comments.)

Sometimes giving syndication feed readers good errors is a mistake

Chris's Wiki :: blog

By: cks

16 February 2026 at 03:56

Yesterday I wrote about the problem of giving feed readers error messages that people will actually see, because you can't just give them HTML text; in practice you have to wrap your HTML text up in a stub, single-entry syndication feed (and then serve it with a HTTP 200 success code). In many situations you're going to want to do this by replying to the initial feed request with a HTTP 302 temporary redirection that winds up on your stub syndication feed (instead of, say, a general HTML page explaining things, such as "this resource is out of service but you might want to look at ...").

Yesterday I put this into effect for certain sorts of problems, including claimed HTTP User-Agents that are for old browser. Then several people reported that this had caused Feedly to start presenting my feed as the special 'your feed reader is (claiming to be) a too-old browser' single entry feed. The apparent direct cause of this is that Feedly made some syndication feed requests with HTTP User-Agent headers of old versions of Chrome and Firefox, which wound up getting a series of HTTP 302 temporary redirections to my new 'your feed reader is a too-old browser' stub feed. Feedly then decided to switch its main feed fetcher over to directly using this new URL for various feeds, despite the HTTP redirections being temporary (and not served for its main feed fetcher, which uses "Feedly/1.0" for its User-Agent).

Feedly has been making these fake browser User-Agent syndication feed fetch attempts for some time, and for some time they've been getting HTTP 302 redirections. However, up until late yesterday, what Feedly wound up on was a regular HTML web page. I have to assume that since this wasn't a valid syndication feed, Feedly ignored it. Only when I did the right thing to give syndication feed readers a good, useful error result did Feedly receive a valid syndication feed and go over the cliff.

Providing a stub syndication feed to communicate errors and problems to syndication feed fetchers is clearly the technically correct answer. However, I'm now somewhat less convinced that it's the most useful answer in practice. In practice, plenty of syndication feed fetchers keep fetching and re-fetching these stub feeds from me, suggesting that people either aren't seeing them or aren't doing anything about it. And now I've seen a feed reader malfunction spectacularly and in a harmful way because I gave it a valid syndication feed result at the end of a temporary HTTP redirection.

(I will probably stick to the current situation, partly because I no longer feel like accepting bad behavior from web agents.)

PS: If you're a feed fetching system, please give your feeds IDs that you put in the User-Agent, so that when they all wind up shifted to the same URL through some misfortune, the website involved can sort them out and redirect them back to the proper URLs.

(3 comments.)

The problem of delivering errors to syndication feed readers

Chris's Wiki :: blog

By: cks

15 February 2026 at 04:30

Suppose, not hypothetically, that there are some feed readers (or at least things fetching your syndication feeds) that are misbehaving or blocked for one reason or another. You could just serve these feed readers HTTP 403 errors and stop there, but you'd like to be more friendly. For regular web browsers, you can either serve a custom HTTP error page that explains the situation or answer with a HTTP 302 temporary redirection to a regular HTML page with the explanation. Often the HTTP 302 redirection will be easier because you can use various regular means to create the HTML pages (and even host them elsewhere if you want). Unfortunately, this probably leaves syndication feed readers out in the cold.

(This can also come up if, for example, you decommission a syndication feed but want to let people know more about the situation than a simple HTTP 404 would give them.)

As far as I know, most syndication feed readers expect that the reply to their HTTP feed fetching request is in some syndication feed format (Atom, RSS, etc), which they will parse, process, and display to the person involved. If they get a reply in a different format, such as text/html, this is an error and it won't be shown to the person. Possible the HTML <title> element will make it through, or the HTTP status code response for an error, or maybe both. But your carefully written HTML error page is unlikely to be seen.

(Since syndication feed readers need to be able to display HTML in general, they could do something to show people at least the basic HTML text they got back. But I don't think this is very common.)

As a practical thing, if you want people using blocked syndication feed readers to have a chance to see your explanation, you need to reply with a syndication feed with an entry that is your (HTML) message to them (either directly or through HTTP 302 redirections). Creating this stub feed and properly serving it to appropriate visitors may be anywhere from annoying to challenging. Also, you can't reply with HTTP error statuses (and the feed) even though that's arguably the right thing to do. If you want syndication feed readers to process your stub feed, you need to provide it as part of a HTTP 200 reply.

(Speaking from personal experience I can say that hand-writing stub Atom syndication feeds is a pain, and it will drive you to put very little HTML in the result. Which is okay, you can make it mostly a link to your regular HTML page about whatever issue it is.)

If you're writing a syndication feed reader, I urge you to optionally display the HTML of any HTTP error response or regular HTML page that you receive. If I was writing some sort of blog system today, I would make it possible to automatically generate a syndication feed version of any special error page the software could serve to people (probably through some magic HTTP redirection). That way people can write each explanation only once and have it work in both contexts.

(5 comments.)

A surprising path to accessing localhost URLs and HTTP services

Chris's Wiki :: blog

By: cks

6 February 2026 at 03:43

One of the classic challenges in web security is DNS rebinding. The simple version is that you put some web service on localhost in order to keep outside people from accessing it, and then some joker out in the world makes 'evil.example.org' resolve to 127.0.0.1 and arranges to get you to make requests to it. Sometimes this is through JavaScript in a browser, and sometimes this is by getting you to fetch things from URLs they supply (because you're running a service that fetches and processes things from external URLs, for example).

One way people defend against this is by screening out 127.0.0.0/8, IPv6's ::1, and other dangerous areas of IP address space from DNS results (either in the DNS resolver or in your own code). And you can also block URLs with these as explicit IP addresses, or 'localhost' or the like. Sometimes you might add extra security restrictions to a process or an environment through means like Linux eBPF to screen out which IP addresses you're allowed to connect to (cf, and I don't know whether systemd's restrictions would block this).

As I discovered the other day, if you connect to INADDR_ANY, you connect to localhost (which any number of people already knew). Then in a comment Kevin Lyda reminded me that INADDR_ANY is also known as 0.0.0.0, and '0' is often accepted as a name that will turn into it, resulting in 'ssh 0' working and also (in some browsers) 'http://0:<port>/'. The IPv6 version of INADDR_ANY is also an all-zero address, and '::0' and '::' are both accepted as names for it, and then of course it's easy to create DNS records that resolve to either the IPv4 or IPv6 versions. As I said on the Fediverse:

Surprise: blocking DNS rebinding to localhost requires screening out more than 127/8 and ::1 answers. This is my face.

It turns out that this came up in mid 2024 in the browser context, as '0.0.0.0 Day' (cf). Modern versions of Chrome and Safari apparently explicitly block requests to 0.0.0.0 (and presumably also the IPv6 version), while Firefox will still accept it. And of course your URL-fetching libraries will almost certainly also accept it, especially through DNS lookups of ordinary looking but attacker controlled hostnames.

In my view, it's not particularly anyone's fault that this slipped through the cracks, both in browsers and in tools that handle fetching content from potentially hostile URLs. The reality of life is that how IP behaves in practice is complicated and some of it is historical practice that's been carried forward and isn't necessarily obvious or well known (and certainly isn't standardized). Then URLs build on top of this somewhat rickety foundation and surprises happen.

(This is related to the issue of browsers being willing to talk to 'local' IPs, which Chrome once attempted to start blocking (and I believe that shipped, but I don't use Chrome any more so I don't know what the current state is).)

(2 comments.)

Single sign on systems versus X.509 certificates for the web

Chris's Wiki :: blog

By: cks

19 January 2026 at 03:59

Modern single sign on specifications such as OIDC and SAML and systems built on top of them are fairly complex things with a lot of moving parts. It's possible to have a somewhat simple surface appearance for using them in web servers, but the actual behind the scenes implementation is typically complicated, and of course you need an identity provider server and its supporting environment as well (which can get complicated). One reaction to this is to suggest using X.509 certificates to authenticate people (as a recent comment on this entry did).

There are a variety of technical considerations here, like to what extent browsers (and other software) might support personal X.509 certificates and make them easy to use, but to my mind there's also an overriding broad consideration that makes the two significantly different. Namely, people can remember passwords but they have to store X.509 certificates. OIDC and SAML may pass around tokens and programs dealing with them may store tokens, but the root of everything is in passwords, and you can recover all the tokens from there. This is not true with X.509 certificates; the certificate is the thing.

(There are also challenges around issuing, managing, checking, and revoking personal X.509 certificates, but let's ignore them.)

To make using X.509 certificate practical for authenticating people, people have to be able to use them on multiple devices and move them between browsers. Many people have multiple devices and people do change what browsers they use (for all that browser and platform vendors like them not to, or at least the ones that are currently popular are often all for that). Today, there is basically nothing that helps people deal with this, and as a result X.509 certificates are at best awkward for people to use (and remember, security is people).

(In common use, it's easy to move passwords between browsers and devices because they're in your head (excluding password managers, which are still not used by a lot of people).)

Of course you could develop standards and software for moving and managing X.509 certificates. In many ways, passkeys show what's possible here, and also show many of the hazards of using things for authentication that can't be memorized (or copied) by people in order to transport them between environments. However, no such standards and software exist today, and no one has every shown much interest in developing them, even back in the days when personal X.509 certificates were close to your only game in town.

(You could also develop much better browser UIs for dealing with personal X.509 certificates, something that was extremely under-developed back in the days when they were sometimes in use. Even importing such a certificate into your browser could be awkward, never mind using it.)

In the past, people have authenticated web applications through the use of personal X.509 certificates (as a more secure form of passwords). As far as I know, pretty much everyone has given up on that and moved to better options, first passwords (sometimes plus some form of additional confirmation) and then these days trying to get people to use passkeys. One reason they gave up was that actually using X.509 certificates in practice was awkward and something that people found quite annoying.

(I had to use a personal X.509 certificate for a while in order to get free TLS certificates for our servers. It wasn't a particularly great experience and I'm not in the least bit surprised that everyone ditched it for single sign on systems.)

PS: It's no good saying that X.509 certificates would be great if all of the required technology was magically developed, because that's not going to just happen. If you want personal X.509 certificates to be a thing, you have a great deal of work ahead of you and there is no guarantee you'll be successful. No one else is going to do that work for you.

PPS: You can imagine a system where people use their passwords and other multi-factor authentication to issue themselves new personal X.509 certificates signed by your local Certificate Authority, so they can recover from losing the X.509 certificate blob (or get a new certificate for a new device). Congratulations, you have just re-invented a manual version of OIDC tokens (also, it's worse in various ways).

(6 comments.)

What 24 hours of traffic looks like to our main web server in January 2026

Chris's Wiki :: blog

By: cks

8 January 2026 at 03:54

One of the services we operate for the department is a traditional Apache-based shared web server, with things like people's home pages (eg), pages for various groups, and so on (we call this our departmental web server). This web server has been there for a very long time and its URLs have spread everywhere, and in the process it's become quite popular for some things. These days there are a lot of things crawling everything in sight, and our server has no general defenses against them (we don't even have much of a robots.txt).

(Technically our perimeter firewall has basic HTTP and HTTPS brute-force connection rate limits, but people typically have to really work to trigger them and they mostly don't. Although now that I look at yesterday, more IPs wound up listed than I expected, although listings normally last at most five minutes.)

The first, very noticeable thing that we have is people who do very slow downloads from us. Our server rolls over the logs at midnight, but Apache only writes a log record when a HTTP request completes, possibly to the old log file. Yesterday (Tuesday), the last log record was written at 05:24, for a request that started at 22:44. Over the 24 hours that requests were initiated in, we saw 1.2 million requests.

The two most active User-Agents were (in somewhat rounded numbers):

426000 "Mozilla/5.0 (iPhone; CPU iPhone OS 18_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.0 Mobile/15E148 Safari/604.1"
424000 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0 Safari/537.36"

The most active thing that was willing to admit it wasn't a human with a browser was "ChatGPT-User", with just under 20,000 requests. After that came "GoogleOther" and "Amazonbot", at about 12,000 requests each, then "Googlebot" with 10,000 and bingbot with about 6,000. Of course, some of those could be people impersonating the real Googlebot and bingbot.

To my surprise, the most popular HTTP result code by far was HTTP 301 Moved Permanently, at 844,000 responses (HTTP 200s were 347,000, everything else was small by comparison). And most of the requests by the those two most active User-Agents got HTTP 301 responses (roughly 418,000 each). I don't know what's going on there, but someone seems to have latched on to a lot of URLs that require redirects (which include things like directory URLs without the '/' on the end). On the positive side, most of those requests will have been pretty cheap for Apache to handle.

A single DigitalOcean IP claiming to be running Chrome 61 on 'Windows NT 10.0' made 11,000 requests, most of which got HTTP 404 errors because it was requesting URLs like '/wp-login.php'. There's no point complaining to hosting providers about this sort of thing, it's just background noise. No other single IP stood out to that degree (well, our monitoring system made over 10,000 requests, but that's expected). Google mostly crawled from a few IPs, with large counts, but other crawlers were more spread out.

To find out more traffic information, we need to go to looking at Autonomous System Numbers (ASNs), using asncounter. This reports:

 count   percent ASN     AS
 463536  36.55   210906  BITE-US, LT
 152237  12.0    212286  LONCONNECT, GB
 65064   5.13    3257    GTT-BACKBONE GTT, US
 53927   4.25    7385    ABUL-14-7385, US
 45255   3.57    8075    MICROSOFT-CORP-MSN-AS-BLOCK, US
 32557   2.57    7029    WINDSTREAM, US
 32101   2.53    55286   SERVER-MANIA, CA
 30037   2.37    15169   GOOGLE, US
 24412   1.92    239     UTORONTO-AS, CA
 21745   1.71    7015    COMCAST-7015, US
 16311   1.29    64200   VIVIDHOSTING, US
 [...]

And then for prefixes:

 count   percent prefix  ASN     AS
 64312   5.07    138.226.96.0/20 3257    GTT-BACKBONE GTT, US
 43459   3.43    85.254.128.0/22 210906  BITE-US, LT
 43161   3.4     185.47.92.0/22  210906  BITE-US, LT
 43111   3.4     45.131.216.0/22 212286  LONCONNECT, GB
 43040   3.39    45.145.136.0/22 212286  LONCONNECT, GB
 42998   3.39    45.138.248.0/22 212286  LONCONNECT, GB
 42870   3.38    185.211.96.0/22 210906  BITE-US, LT
 32365   2.55    85.254.112.0/22 210906  BITE-US, LT
 26937   2.12    66.249.64.0/20  15169   GOOGLE, US
 23785   1.88    128.100.0.0/16  239     UTORONTO-AS, CA
 23088   1.82    45.154.148.0/22 212286  LONCONNECT, GB
 21767   1.72    85.254.42.0/23  210906  BITE-US, LT
 [and then five more BITE-US prefixes at the same
  volume level, then many more prefixes]

Given that we have two extremely prolific User-Agents, let's look at where those requests came from in specific, and you will probably not be surprised at the results:

 count   percent ASN     AS
 462925  54.37   210906  BITE-US, LT
 152155  17.87   212286  LONCONNECT, GB
 64321   7.55    3257    GTT-BACKBONE GTT, US
 53649   6.3     7385    ABUL-14-7385, US
 32287   3.79    7029    WINDSTREAM, US
 31955   3.75    55286   SERVER-MANIA, CA
 21710   2.55    7015    COMCAST-7015, US
 16304   1.92    64200   VIVIDHOSTING, US
 [...]

If you have the ability to block traffic by ASN and you don't need to accept requests from clouds and your traffic is anything like this, you can probably drop a lot of it quite easily.

I can ask a different question: if we exclude those two popular User-Agents and look only at successful requests (HTTP 200 responses), where do they come from?

 count   percent ASN     AS
 38821   11.61   8075    MICROSOFT-CORP-MSN-AS-BLOCK, US
 25510   7.63    15169   GOOGLE, US
 16968   5.07    239     UTORONTO-AS, CA
 12816   3.83    14618   AMAZON-AES, US
 11529   3.45    396982  GOOGLE-CLOUD-PLATFORM, US
 [...]

(There are about 334,000 of these in total.)

The 'UTORONTO-AS' listing includes our own monitoring, with its 10,000 odd requests. Much of Google's requests come from their 66.249.64.0/20 prefix, which is mostly or entirely used by various Google crawlers.

Around 138,000 requests were for a set of commonly used ML training data, and they probably account for most of the bandwidth used by this web server (which typically averages 40 Mbytes/sec of outgoing bandwidth all of the time on weekdays).

(I've previously done HTTP/2 stats for this server as of mid 2025.)

(One comment.)

Some notes on using the Sec-CH-UA HTTP headers that Chrome supports

Chris's Wiki :: blog

By: cks

25 December 2025 at 02:39

A while back, Chrome proposed and implemented what are called user agent hints, which are a collection of Sec-CH-UA HTTP headers that can provide you with additional information about the browser beyond what the HTTP User-Agent header provides. As mentioned, only Chrome and browsers derived from Chromium (or if you prefer, 'Blink') support these headers, and only since early 2021 (for Chrome; later for some others). However, Chrome is what a lot of people use. More to the point, Chrome is what a lot of bad crawlers claim to be in their User-Agent header. As has been written up by other people, you can use these headers to detect inconsistencies that give away crawlers.

In an ideal world, it would be enough to detect a recent enough Chrome version and then require it to be consistent between the User-Agent, the platform from Sec-CH-UA-Platform, and the version information from Sec-CH-UA. We don't live in an ideal world. The first issue is that some versions of Chrome don't send these user agent hints by default (I've seen this specifically from Android Pixel devices). To get them to do so, you must reply with a HTTP 307 redirection that includes Accept-CH and Critical-CH headers for the Sec-CH-UA headers you care about. I'm not sure if you can redirect the browser to the current URL; I opt to redirect to the URL with a special query parameter added, which then redirects back to the original version of the URL.

(One advantage of this is that in my HTTP request handling, I can reject a request with the special query parameter if it still doesn't including the Sec-CH-UA headers I ask for. This avoids infinite redirect loops and lets me log definite failures. Chrome browser setups that refuse to provide them even when requested are currently redirected to an error page explaining the situation.)

Cross checking the browser version from Sec-CH-UA against the 'browser version' in the User-Agent is complicated by the question of what is a browser version. This is especially the case because the 'brand names' used in Sec-CH-UA aren't necessarily the '<whatever>/<ver>' names used in the User-Agent; for example, Microsoft Edge will report itself as 'Microsoft Edge' in Sec-CH-UA but 'Edg/' in the User-Agent. Some browsers based on Chrome will report a Chrome version that is the same as their brand name version (this appears to be true for Edge, for example), but others definitely won't, so you may need a mapping table from brand name to User-Agent name if you want to go that far. Sometimes the best you can do is verify the claimed 'Chromium' version against the 'Chrome/' version from the User-Agent.

Platform names definitely require a mapping from the Sec-CH-UA-Platform value to what appears in the User-Agent. On top of that, sometimes browsers will change their User-Agent platform name without changing Sec-CH-UA-Platform. One case I know of is that some versions of Android Opera (and perhaps Chrome) will change their User-Agent to say they're on Linux if you have them ask for the 'desktop' version of a site, but still report the Android values in their Sec-CH-UA headers (and say that they aren't a mobile device in Sec-CH-UA-Mobile, which is fair enough). It's hard to object to this behavior in a world where User-Agent sniffing is one way that websites decide on regular versus 'mobile' versions.

My use of Sec-CH-UA checks so far here on Wandering Thoughts has turned up several sorts of bad behavior in crawlers (so far). As I sort of expected, the most common behavior is crawlers that claim to be Chrome in their User-Agent (or something derived from it) but don't supply any Sec-CH-UA headers; this is now a straightforward bad idea even if you mention your crawler in your User-Agent. Some crawlers report one Chrome version in Sec-CH-UA but another one in their User-Agent, usually with the User-Agent version being older. I suspect that these crawlers are based on Chromium and periodically update their Chromium version, but statically configure their User-Agent and don't update it. Some of these crawlers also report a different platform between Sec-CH-UA-Platform and their User-Agent (so far all of them have been running on macOS but saying they were Windows 10 or 11 machines in their User-Agent). The third case is things that report they are headless Chrome in their Sec-CH-UA header (and I reject them).

(This is where the Internet Archive gets a dishonorable mention; currently their crawling often has mismatched User-Agent and Sec-CH-UA headers. Sometimes they have a special marker in the User-Agent and sometimes it's just mismatched Chrome information.)

I've also seen some weird cases so far where a crawler provided Sec-CH-UA headers despite claiming to be Firefox in its User-Agent. My data so far is incomplete, but some of these have had mismatches between Sec-CH-UA-Platform and the User-Agent, while another claimed to be Chrome 88 (which in theory is before Chrome supported them) while saying it was Firefox 120 in its User-Agent. I've improved my logging and error reporting so I may get slightly better data on this in a while.

At the same time, checking Sec-CH-UA headers (and checking them against User-Agent headers) will definitely not defeat all bad crawlers. Some crawlers are clearly using either real browsers or software that fakes everything together properly. I suspect the latter because the most recent case involves a horde of IPs claiming to be Chrome 142 on macOS 10.15.7, which I doubt is so universal a configuration (especially on datacenter VPSes and servers). As with email spam, all of this is a constant race of heuristics against the bad actors.

(It's hard to judge my new Sec-CH-UA checks compared to my existing header checks because of check ordering. If I was sufficiently energetic I'd try to do all of the checks before rejecting anything and log all failed checks, but as it is I do checks one by one and reject (or redirect with Critical-CH) at the first failed one.)

Browser version numbers are a bit complicated (for server code)

Chris's Wiki :: blog

By: cks

22 December 2025 at 04:15

Suppose, not entirely hypothetically, that you're writing code that for some reason wants to determine a 'browser version' from something and then cross-check it against other sources of browser version information. Possibly you also want to notice when you're not working with real browsers and not apply your version consistency checks to them. When you're starting out, it looks like what your code should do is return a browser name and version number. Unfortunately, this is a naive view, partly because of all of the browsers based on Chrome (or Chromium) and partly because of mobile device WebViews, which reuse a browser engine without being the browser.

The theoretically correct and maximally flexible approach would be to parse all possible version indicators of everything from whatever source of information you're using, such as the browser User-Agent or user agent client hints, and return them as a big map, possibly augmented with your best guess at what the 'browser' as such is. If applied to a User-Agent string such as this:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36 OPR/125.0.0.0

Parsing this might give you identifiers and versions of AppleWebKit 537.36, Chrome 141, Safari 537.36, and OPR 125, and you'd guess that the browser is Opera and it's based on Chromium 141 (which is potentially important for what features and behavior should be present). There are complications in parsing this, because sometimes you'll see "Mobile Safari/537.36", and sometimes you'll see mysterious additions like 'Version/4.0' or 'ABB/133.0.6943.51' (and I haven't even gone into what you might see on iOS). Simply fully parsing the User-Agent string is complicated (although there are projects that do this for you, such as the User Agent String Parser and the Python user-agents package).

(For instance, did you know that Firefox reports its Gecko version in at least two ways? On desktop Firefox, it's always 'Gecko/20100101'. On Android Firefox, it can be 'Gecko/146.0', perhaps always matching the Firefox/ version.)

One problem is that a giant map is not necessarily entirely useful to code that wants to use browser version information, especially since the browser names in data may not match the common names you know them by. For example, on iOS devices Firefox reports 'FxiOS' and Chrome reports 'CriOS', which is in one sense accurate because these two iOS browsers don't have the behavior of their regular counterparts since they're built on top of Apple's WebKit, not their own browser engines (and as a result Chrome on iOS doesn't report user agent client hints). Do you want to treat FxiOS as a different browser from Firefox or not? That depends.

Currently, the minimum information I think you want to provide is the name and version of both the browser engine and the 'browser' itself. Given WebViews, Chromium, and other similar situations, you may not be able to reliably determine the browser, and sometimes you won't have either. When parsing the User-Agent string for Chrome, you don't get an explicit version for Chromium, so you have to assume it's the same as the Chrome version; for Chrome derived browsers I think you can assume that the 'Chrome/...' version reported is the version of their underlying Chromium. If present, the HTTP Sec-CH-UA header can give you the Chromium version directly and also perhaps tell you if you have a genuine Chrome or another brand where you (or your User-Agent parser) don't recognize their User-Agent marker.

It's now a bad idea to look like a browser in your HTTP User-Agent

Chris's Wiki :: blog

By: cks

20 December 2025 at 03:34

Once upon a time, something like the following was a perfectly decent User-Agent header string for a web crawler or a web fetching agent:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.0.0 Safari/537.36 (compatible; Yourbot; +https://some/url)

You weren't hiding, after all, you called yourself 'Yourbot', and for the rest, you were asking for people to serve you pages like you were Chrome. Well, I'm not too sad to say, those days are over.

They're over because an increasing number of websites are increasingly requiring that anything that looks like a browser in its User-Agent also act like a browser, in specific the browser and browser version it's saying it is, and there are a lot of picky details around other HTTP headers (also). For example, often simply having 'Mozilla' in your User-Agent will cause Anubis to challenge your crawler (cf). And the version of Chrome being asserted here is new enough that it should be reporting a Sec-CH-UA-Platform header, among other Sec-CH- headers.

(Claiming to be a really old version of Chrome without those features is likely to be worse.)

Now, you can certainly pin your hopes on the idea that people who are writing header checking code will pay attention to the presence of the 'compatible;' and the URL in your User-Agent, and realize that you're not actually a browser despite you having a fairly good imitation of a Chrome User-Agent. However, you're not Google(bot). People have to make exceptions for Googlebot (to some degree), but they don't have to make exceptions for you and they probably won't.

The User-Agent you should instead use today is something like, for example:

Fedithing/4.5.1 (library/1.2.3; +https://some/url)

You don't start with a superstitious invocation of 'Mozilla/5.0', you don't claim to be be like any version of any browser, and you put in the basics of identifying your software and yourself so no one can accuse you of hiding. No one is going to match your User-Agent against detectors for old versions of browsers, or things claiming to be browser but lacking their headers, and so on, because you haven't put in the names of any browsers.

PS: Googlebot and Bingbot and a few others still use User-Agent strings very much like my first example, but they're Googlebot (and Bingbot) and to a fair extent they do get their HTTP headers relatively authentic.

Fake "web browsers" and their (lack of) HTTP headers: some notes

Chris's Wiki :: blog

By: cks

18 December 2025 at 03:34

It's hopefully not news to people that there is a plague of disguised web crawlers that are imitating web browsers (and not infrequently crawling from residential IPs, through various extremely questionable methods). However, many of these crawlers have only a skin-deep imitation of browsers, primarily done through their HTTP User-Agent header. This creates a situation where some of these crawlers can currently be detected (and blocked) because they either lack entirely or have non-browser values for other HTTP headers. I've been engaged in a little campaign to reduce the crawler presence here on Wandering Thoughts, so I've been experimenting with a number of HTTP header checks.

Headers I'm currently looking at include:

The CF-Worker header is set for all requests from Cloudflare Workers. Anubis blocks all requests with this header set by default (cf), and I decided to copy it. This occasionally blocks things trying to scrape Wandering Thoughts.
As I discovered, you can't block requests with X-Forwarded-For headers because people really do set these headers on real, non-malicious requests.
The Sec-Fetch-Mode header is sent by every modern browser and is sent by almost no bad crawlers. However, checking things claiming to be Safari is a little bit complicated, since Sec-Fetch-Mode support was only added in early 2023 (in 16.4) and there are still older Safari versions out there (including earlier 16.x versions). This is a quite effective check in my environment.
(I got this trick from here, although apparently there may be trouble with mobile WebView interfaces, which might come about through in-app navigation if someone sends a URL around.)
Every mainstream browser sends an Accept-Encoding header and has for a long time. If it's missing for a fetch of a regular HTML page, you have an imposter. Unless you like maintaining a list of old browsers and other programs that don't send Accept-Encoding, you probably want to limit requiring the header to things claiming to be at least a bit like mainstream browsers.
Some bad bots are sending an Accept-Encoding of 'identity' in what is apparently an attempt to avoid being fed compression bombs by people (I can't find my source for this). No mainstream browser should do this and in general most things fetching web pages from you should accept compressed responses if they advertise an Accept-Encoding at all.
Sadly, the exception to this is syndication feed fetchers, some of which refuse to do compression. Whether you keep supporting such feed fetchers is up to you. Wandering Thoughts still does so far, although it's getting tempting to say that enough is enough, especially with the size of syndication feeds here.
Some or perhaps many bad crawlers set a HTTP Accept header of '*/*' on HTML requests, which isn't something that real browsers do (source). Unfortunately, browser-based syndication feed fetchers will send this value, so you can only do this check on HTML pages, and also bingbot and Googlebot (at least) will sometimes also send this Accept value. Some things seem to not end an Accept header at all, too.
Based on monitoring the results so far, there may be something funny going on; I've seen the same IP and User-Agent making an initial request that is fine and then one or more re-requests for the same URL that have 'Accept: */*' and fail
A number of bad crawlers make HTTP/1.0 requests while claiming to be mainstream browsers, all of which have supported HTTP/1.1 for a very long time, and these days I block such requests. Although it's tempting to reject all HTTP/1.0 requests, some text-mode browsers still make them (the ones I know of are Lynx and w3m, including inside GNU Emacs). The HTTP version isn't really a HTTP header, but close enough.

Some of these checks overlap with each other. For example, the crawler with a bad Accept: HTTP header wasn't sending Sec-Fetch-Mode either.

Many of these HTTP headers are only sent by relatively mainstream browsers and environments that have added support for recent HTTP headers. For example, people still use text-based browsers and most of them don't send headers like Sec-Fetch-Mode; other programs that make HTTP requests through various packages and libraries probably won't either.

There are probably other useful header differences between crawlers imitating mainstream browsers and actual browsers (and, apparently, between headless browsers being driven by automation and real ones being used by people). You could probably discover some of them by collecting enough of a data set of request headers and then doing some sort of statistical analysis to discover correlations and clusters.

PS: The big offenders for requesting uncompressed syndication feeds appear to be Tiny Tiny RSS, Selfoss, and Nextcloud-News. Some browser based syndication feed readers also appear to do it, as do some curl-based syndication feed fetching that people are doing here.

Sidebar: What is a (mainstream) browser-like User-Agent?

It depends on how restrictive you want to be. There are a lot of options:

Just look for "Mozilla/5.0 (" at the start of the User-Agent.
Also look for " Chrome/", " Firefox/", or " AppleWebKit/" in the User-Agent
Try to specifically match a Firefox or Webkit based browser User-Agent format, which will cause you to learn a lot about what Webkit-based user agents appear in your logs.
Potentially exclude things that mark themselves as robots or crawlers, for example by having 'compatible;' in their User-Agent, or 'robot', or a URL. Anything with these markers is not trying to exactly be a browser User-Agent, although they may be looking generally like one.

I use different versions of these for different checks in DWiki's steadily growing pile of hacks to detect bad crawlers. Currently the most specific matching is reserved for blocking claimed browsers from cloud/server space, which catches a significant amount even with a limited selection of cloud and VPS provider space that it applies to.

(Some cloud space is blocked entirely; blocking only things that claim to be browsers is a lesser step.)

(2 comments.)

Self-hosting your Mastodon media with SeaweedFS

IT+Notes

By: Stefano Marinelli

6 November 2025 at 11:30

A practical guide to boosting Mastodon performance by self-hosting your media with SeaweedFS. Configure a fast, S3-compatible storage backend to efficiently handle your instance's files and take full control of your data.

Web Development Tip: Disable Pointer Events on Link Images

Pixel Envy

By: Nick Heer

27 November 2025 at 04:45

Good tip from Jeff Johnson:

My business website has a number of “Download on the App Store” links for my App Store apps. Here’s an example of what that looks like:

[…]

The problem is that Live Text, “Select text in images to copy or take action,” is enabled by default on iOS devices (Settings → General → Language & Region), which can interfere with the contextual menu in Safari. Pressing down on the above link may select the text inside the image instead of selecting the link URL.

I love the Live Text feature, but it often conflicts with graphics like these. There is a good, simple, two-line CSS trick for web developers that should cover most situations. Also, if you rock a user stylesheet — and I think you should — it seems to work fine as a universal solution. Any issues I have found have been minor and not worth noting. I say give it a shot.

Update: Adding Johnson’s CSS to a user stylesheet mucks up the layout of Techmeme a little bit. You can exclude it by adding div:not(.ii) > before a:has(> img) { display: inline-block; }.

⌥ Permalink

Do you care about (all) HTTP requests from cloud provider IP address space?

Chris's Wiki :: blog

By: cks

29 November 2025 at 04:21

About a month ago Mike Hoye wrote Raised Shields, in which Hoye said, about defending small websites from crawler abuse in this day and age:

If you only care about humans I strongly advise you to block every cloudhost subnet you can find, pretty easy given the effort they put into finding you. Most of the worst actors out there are living comfortably on Azure, GCP, Yandex and sometimes Huawei’s servers.

(As usual, there's no point in complaining about abusive crawlers to the cloud providers.)

I've said something similar on the Fediverse:

Today's idle thought: how many small web servers actually have any reason to accept requests from AWS or Google Cloud IP address space? If you search through your logs with (eg) grepcidr, you may find that there's little or nothing of value coming from there, and they sure are popular with LLM crawlers these days.

You definitely want to search your logs before doing this, and you may find that you want to make some exceptions even if you do opt for it. For example, you might want or need to let cloud-hosted things fetch your syndication feeds, because there are a fair number of people and feed readers that do their fetching from the cloud. Possibly you'll find that you have a significant number of real visitors that are using do it yourself personal VPN setups that have cloud exit points.

(How many exceptions you want to make may depend on how much of a hard line you want to take. I suspect that Mike Hoye's line is much harder than mine.)

However, I think that for a lot of small, personal web servers and web sites you'll find that almost nothing of genuine value comes from the big cloud provider networks, from AWS, Google Cloud, Azure, Oracle, and so on. You're probably not getting real visitors from these clouds, people who are interested in reading your work and engaging with it. Instead you'll most likely see an ever-growing horde of obvious crawlers, increasingly suspicious user agents, claims to be things that they aren't, and so on.

On the one hand, it's in some sense morally pure to not block these cloud areas unless they're causing your site active harm; it's certainly what the ethos was on the older Internet, and it was a good and useful ethos for those times. On the other hand, that view is part of what got us here. More and more, these days are the days of Raised Shields, as we react to the new environment (much as email had to react to the new environment of ever increasing spam).

If you're doing this, one useful trick you can play if you have the right web server environment is to do your blocking with HTTP 429 Too Many Requests responses. Using this HTTP code is in some sense inaccurate, but it has the useful effect that very few things will take it as a permanent error the way they may take, for example, HTTP 403 (or HTTP 404). This gives you a chance to monitor your web server logs and add a suitable exemption for traffic that you turn out to want after all, without your error responses doing anything permanent (like potentially removing your pages from search engine indexes). You can also arrange to serve up a custom error page for this case, with an explanation or a link to an explanation.

(My view is that serving a 400-series HTTP error response is better than a HTTP 302 temporary redirect to your explanation, for various reasons. Possibly there are clever things you can do with error pages in general.)

(6 comments.)

People are sending HTTP requests with X-Forwarded-For across the Internet

Chris's Wiki :: blog

By: cks

17 November 2025 at 03:49

Over on the Fediverse, I shared a discovery that came from turning over some rocks here on Wandering Thoughts:

This is my face when some people out there on the Internet send out HTTP requests with X-Forwarded-For headers, and maybe even not maliciously or lying. Take a bow, ZScaler.

The HTTP X-Forwarded-For header is something that I normally expect to see only on something behind a reverse proxy, where the reverse proxy frontend is using it to tell the backend the real originating IP (which is otherwise not available when the HTTP requests are forwarded with HTTP). As a corollary of this usage, if you're operating a reverse proxy frontend you want to remove or rename any X-Forwarded-For headers that you receive from the HTTP client, because it may be trying to fool your backend about who it is. You can use another X- header name for this purpose if you want, but using X-Forwarded-For has the advantage that it's a de-facto standard and so random reverse proxy aware software is likely to have an option to look at X-Forwarded-For.

(See, for example, the security and privacy concerns section of the MDN page.)

Wandering Thoughts doesn't run behind a reverse proxy, and so I assume that I wouldn't see X-Forwarded-For headers if I looked for them. More exactly I assumed that I could take the presence of an X-Forwarded-For header as an indication of a bad request. As I found out, this doesn't seem to be the case; one source of apparently legitimate traffic to Wandering Thoughts appears to attach what are probably legitimate X-Forwarded-For headers to requests going through it. I believe this particular place operates partly as a (forward) HTTP proxy; if they aren't making up the X-Forwarded-For IP addresses, they're willing to leak the origin IPs of people using them to third parties.

All of this makes me more curious than usual to know what HTTP headers and header values show up on requests to Wandering Thoughts. But not curious enough to stick in logging, because that would be quite verbose unless I could narrow things down to only some requests. Possibly I should stick in logging that can be quickly turned on and off, so I can dump header information only briefly.

(These days I've periodically wound up in a mood to hack on DWiki, the underlying engine behind Wandering Thoughts. It reminds me that I enjoy programming.)

(5 comments.)

Getting feedback as a small web crawler operator

Chris's Wiki :: blog

By: cks

13 November 2025 at 04:17

Suppose, hypothetically, that you're trying to set up a small web crawler for a good purpose. These days you might be focused on web search for text focused sites, or small human written sites, or similar things, and certainly given the bad things that are happening with the major crawlers we could use them. As a small crawler, you might want to get feedback and problem reports from web site operators about what your crawler is doing (or not doing). As it happens, I have some advice and views on this.

Above all, remember that you are not Google or even Bing. Web site operators need Google to crawl them, and they have no choice but to bend over backward for Google and to send out plaintive signals into the void if Googlebot is doing something undesirable. Since you're not Google and you need websites much more than they need you, the simplest thing for website operators to do with and about your crawler is to ignore the issue, potentially block you if you're causing problems, and move on.
You cannot expect people to routinely reach out to you. Anyone who does reach out to you is axiomatically doing you a favour, at the expense of some amount of their limited time and at some risk to themselves.
Website operators have no reason to trust you or trust that problem reports will be well received. This is a lesson plenty of people have painfully learned from reporting spam (email or otherwise) and other abuse; a lot of the time your reports can wind up in the hands of people who aren't well intentioned toward you (either going directly to them or 'helpfully' being passed on by the ISP). At best you confirm that your email address is alive and get added to more spam address lists; at worst you get abused in various ways.
The consequence of this is that if you want to get feedback, you should make it as low-risk as possible for people. The lowest risk way (to website operators) is for you to have a feedback form on your site that doesn't require email or other contact methods. If you require that website operators reveal their email addresses, social media handles, or whatever, you will get much less feedback (this includes VCS forge handles if you force them to make issue reports on some VCS forge).
(This feedback form should be easy to find, for example being directly linked from the web crawler information URL in your User-Agent.)
As far as feedback goes, both your intentions and your views on the reasonableness of what your web crawler is doing (and how someone's website behaves) are irrelevant. What matters is the views of website operators, who are generally doing you a favour by not simply blocking or ignoring your crawler and moving on. If you disagree with their feedback, the best thing to do is be quiet (and maybe say something neutral if they ask for a reply). This is probably most important if your feedback happens through a public VCS forge issue tracker, where future people who are thinking about filing an issue the way you asked may skim over past issues to see how they went.
(You may or may not ignore website operator feedback that you disagree with depending on how much you want to crawl (all of) their site.)

At the moment, most website operators who notice a previously unknown crawler will likely assume that it's an (abusive) LLM crawler. One way to lower the chances of this is to follow social conventions around crawlers for things like crawler User-Agents and not setting the Referer header. I don't think you have to completely imitate how Googlebot, bingbot, Applebot, the archive.org bot and so on format their User-Agent strings, but it's going to help to generally look like them and clearly put the same sort of information into yours. Similarly, if you can it will help to crawl from clearly identified IPs with reverse DNS. The more that people think you're legitimate and honest, the more likely they are to spend the time and take the risk to give you feedback; the more sketchy or even uncertain you look, the less likely you are to get feedback.

(In general, any time you make website operators uncertain about an aspect of your web crawler, some number of them will not be charitable in their guess. The more explicit and unambiguous you are in the more places, the better.)

Building and running a web crawler is not an easy thing on today's web. It requires both technical knowledge of various details of HTTP and how you're supposed to react to things (eg), and current social knowledge of what is customary and expected of web crawlers, as well as what you may need to avoid (for example, you may not want to start your User-Agent with 'Mozilla/5.0' any more, and in general the whole anti-crawling area is rapidly changing and evolving right now). Many website operators revisit blocks and other reactions to 'bad' web crawlers only infrequently, so you may only get one chance to get things right. This expertise can't be outsourced to a random web crawling library because many of them don't have it either.

(While this entry was sparked by a conversation I had on the Fediverse, I want to be explicit that it is in no way intended as a subtoot of that conversation. I just realized that I had some general views that didn't fit within the margins of Fediverse posts.)

(One comment.)

Firefox's sudden weird font choice and fixing it

Chris's Wiki :: blog

By: cks

12 November 2025 at 04:03

Today, while I was in the middle of using my normal browser instance, it decided to switch from DejaVu Sans to Noto Sans as my default font:

Dear Firefox: why are you using Noto Sans all of a sudden? I have you set to DejaVu Sans (and DejaVu everything), and fc-match 'sans' and fc-match serif both say they're DejaVu (and give the DejaVu TTF files). This is my angry face.

This is a quite noticeable change for me because it changes the font I see on Wandering Thoughts, my start page, and other things that don't set any sort of explicit font. I don't like how Noto Sans looks and I want DejaVu Sans.

(I found out that it was specifically Noto Sans that Firefox was using all of a sudden through the Web Developer tools 'Font' information, and confirmed that Firefox should still be using DejaVu through the way to see this in Settings.)

After some flailing around, it appears that what I needed to do to fix this was explicitly set about:config's font.name.serif.x-western, font.name.sans-serif.x-western, and font.name.monospace.x-western to specific values instead of leaving them set to nothing, which seems to have caused Firefox to arrive on Noto Sans through some mysterious process (since the generic system font name 'sans' was still mapping to DejaVu Sans). I don't know if these are exposed through the Fonts advanced options in Settings → General, which are (still) confusing in general. It's possible that these are what are used for 'Latin'.

(I used to be using the default 'sans', 'serif', and 'monospace' font names that cascaded through to the DejaVu family. Now I've specifically set everything to the DejaVu set, because if something in Fedora or Firefox decides that the default mapping should be different, I don't want Firefox to follow it, I want it to stay with DejaVu.)

I don't know why Firefox would suddenly decide these pages are 'western' instead of 'unicode'; all of them are served as or labeled as UTF-8, and nothing about that has changed recently. Unfortunately, as far as I know there's no way to get Firefox to tell you what font.name preference name it used to pick (default) fonts for a HTML document. When it sends HTTP 304 Not Modified responses, Wandering Thoughts doesn't include a Content-Type header (with the UTF-8 character set), but as far as I know that's a standard behavior and browsers presumably cope with it.

(Firefox does see 'Noto Sans' as a system UI font, which it uses on things like HTML form buttons, so it didn't come from nowhere.)

It makes me sad that Firefox continues to have no global default font choice. You can set 'Unicode' but as I've just seen, this doesn't make what you set there the default for unset font preferences, and the only way to find out what unset font preferences you have is to inspect about:config.

PS: For people who aren't aware of this, it's possible for Firefox to forget some of your about:config preferences. Working around this probably requires using Firefox policies (via), which can force-set arbitrary about:config preferences (among other things).

(2 comments.)

A HTTP User-Agent that claims to be Googlebot is now a bad idea

Chris's Wiki :: blog

By: cks

9 November 2025 at 04:04

Once upon a time, people seem to have had a little thing for mentioning Googlebot in their HTTP User-Agent header, much like browsers threw in claims to make them look like Firefox or whatever (the ultimate source of the now-ritual 'Mozilla/5.0' at the start of almost every browser's User-Agent). People might put in 'allow like Googlebot' or just say 'Googlebot' in their User-Agent. Some people are still doing this today, for example:

Gwene/1.0 (The gwene.org rss-to-news gateway) Googlebot

This is now an increasingly bad idea on the web and if you're doing it, you should stop. The problem is that there are various malicious crawlers out there claiming to be Googlebot, and Google publishes their crawler IP address ranges. Anything claiming to be Googlebot that is not from a listed Google IP is extremely suspicious and in this day and age of increasing anti-crawler defenses, blocking all 'Googlebot' activity that isn't from one of their listed IP ranges is an obvious thing to do. Web sites may go even further and immediately taint the IP address or IP address range involved in impersonating Googlebot, blocking or degrading further requests regardless of the User-Agent.

(Gwene is not exactly claiming to be Googlebot but they're trying to get simple Googlebot-recognizers to match them against Googlebot allowances. This is questionable at best. These days such attempts may do more harm than good as they get swept up in precautions against Googlebot forgery, or rules that block Googlebot from things it shouldn't be fetching, like syndication feeds.)

A similar thing applies to bingbot and the User-Agent of any other prominent web search engines, and Bing does publish their IP address ranges. However, I don't think I've ever seen someone impersonate bingbot (which probably doesn't surprise anyone). I don't know if anyone ever impersonates Archive.org (no one has in the past week here), but it's possible that crawler operators will fish to see if people give special allowances to them that can be exploited.

(The corollary of this is that if you have a website, an extremely good signal of bad stuff is someone impersonating Googlebot and maybe you could easily block that. I think this would be fairly easy to do in an Apache <If> clause that then Allow's from Googlebot's listed IP addresses and Denies everything else, but I haven't actually tested it.)

(One comment.)

Trying to understand Firefox's approaches to tracking cookie isolation

Chris's Wiki :: blog

By: cks

2 November 2025 at 02:50

As I learned recently, modern versions of Firefox have two different techniques that try to defeat (unknown) tracking cookies. As covered in the browser addon JavaScript API documentation, in Tracking protection, these are called first-party isolation and dynamic partitioning (or storage partitioning, the documentation seems to use both). Of these two, first party isolation is the easier to describe and understand. To quote the documentation:

When first-party isolation is on, cookies are qualified by the domain of the original page the user visited (essentially, the domain shown to the user in the URL bar, also known as the "first-party domain").

(In practice, this appears to be the top level domain of the site, not necessarily the site's domain itself. For example, Cookie Manager reports that a cookie set from '<...>.cs.toronto.edu' has the first party domain 'toronto.edu'.)

Storage partitioning is harder to understand, and again I'll quote the Storage partitioning section of the cookie API documentation:

When using dynamic partitioning, Firefox partitions the storage accessible to JavaScript APIs by top-level site while providing appropriate access to unpartitioned storage to enable common use cases. [...]

Generally, top-level documents are in unpartitioned storage, while third-party iframes are in partitioned storage. If a partition key cannot be determined, the default (unpartitioned storage) is used. [...]

If you read non-technical writeups like Firefox rolling out Total Cookie Protection (from 2022), it certainly sounds like they're describing first-party isolation. However, if you check things like Status of partitioning in Firefox and the cookies API documentation on first-party isolation, as far as I can tell what Firefox actually normally uses for "Total Cookie Protection" is storage partitioning.

Based on what I can decode from the two descriptions and from the fact that Tor Browser defaults to first-party isolation, it appears that first-party isolation is better and stricter than storage partitioning. Presumably it also causes problems on more websites, enough so that Firefox either no longer uses it for Total Cookie Protection or never did, despite their description sounding like first-party isolation.

(So far I haven't run into any issues with first-party isolation in my cookie-heavy browser environment. It's possible that websites have switched how they do things to avoid problems.)

First-party isolation can be enabled in about:config by setting privacy.firstparty.isolate to true. If and when you do this, the normal Settings → Privacy and Security will show a warning banner at the top to the effect of:

You are using First Party Isolation (FPI), which overrides some of Firefox’s cookie settings.

All of this is relevant to me because one of my add-ons, Cookie AutoDelete, probably works with first-party isolation but almost certainly doesn't work with storage isolation (ie, it will fail to delete some cookies under storage isolation, although I believe it can still delete unpartitioned cookies). Given what I've learned, I'm likely to turn on first-party isolation in my main browser environment soon.

If Cookie Manager is reporting correct information to me, it's possible to have cookies that are both first-party isolated and partitioned; the one I've seen so far is from Youtube. Cookie Manager can't seem to remove these cookies. Based on what I've read about (storage or dynamic) partitioned cookies, I suspect that these are created by embedded iframes.

(Turning on or off first-party isolation effectively drops all of the cookies you currently have, so it's probably best to do it when you restart your browser.)

(9 comments.)

Firefox, the Cookie AutoDelete add-on, and "Total Cookie Protection"

Chris's Wiki :: blog

By: cks

31 October 2025 at 03:15

In a comment on my entry on flailing around with Firefox's Multi-Account Containers, Ian Z aka nobrowser asked a good question:

The Cookie Autodelete instructions with respect to Total Cookie Protection mode are very confusing. Reading them makes me think this extension is not for me, as I have Strict Mode on in all windows, private or not. [...]

This is an interesting question (and, it turns out, relevant to my usage too) so I did some digging. The short answer is that I suspect the warning on Cookie AutoDelete's add-on page is out of date and it works fine. The long answer starts with the history of HTTP cookies.

Back in the old days, HTTP cookies were global, which is to say that browsers kept a global pool of HTTP cookies (both first party, from the website you were on, and third-party cookies), and it would send any appropriate cookie on any HTTP request to its site. This enabled third-party tracking cookies and a certain amount of CSRF attacks, since the browser would happily send your login cookies along with that request initiated by the JavaScript on some sketchy website you'd accidentally wound up on (or JavaScript injected through an ad network).

This was obviously less than ideal and people wound up working to limit the scope of HTTP cookies, starting with things like Firefox's containers and eventually escalating to first-party cookie isolation, where a cookie is restricted to whatever the first-party domain was when it was set. If you're browsing example.org and the page loads google.com/tracker, which sets a tracker cookie, that cookie will not be sent when you browse example.com and the page also loads google.com/tracker; the first tracking cookie is isolated to example.org.

(There is also storage isolation for cookies, but I think that's been displaced by first-party cookie isolation.)

However, first-party isolation has the possibility to break things you expect to work, as covered in this Firefox FAQ). As a result of this, my impression is that browsers have been cautious and slow to roll out first-party isolation by default. However, they have made it available as an option or part of an option. Firefox calls this Total Cookie Protection (also, also).

(Firefox is working to go even further, blocking all third-party cookies.)

Firefox add-ons have special APIs that allow them to do privileged things, and these include an API for dealing with cookies. When first-party cookie isolation came to pass, these APIs needed to be updated to deal with such isolated cookies (and cookie tracking protection in general). For instance, cookies.remove() has to be passed a special parameter to remove a first-party isolated cookie. As covered in the documentation, an add-on using the cookies APIs without the necessary updates would only see non-isolated cookies, if there were any. So at the time the message on Cookie AutoDelete's add-on page was written, I suspect that it hadn't been updated for first-party isolation. However, based on checking the source code of Cookie AutoDelete, I believe that it currently supports first-party isolation for cookies, and in fact may have done so for some time, perhaps v3.5.0, or v3.4.0 or even earlier.

(It's also possible that this support is incomplete or buggy, or that there are still some things that you can't easily do through it that matter to Cookie AutoDelete.)

Cookie AutoDelete itself is potentially useful even if you have Firefox set to block all third-party cookies, because it will also clean up unwanted first-party cookies (assuming that it truly works with first-party isolation). Part of my uncertainly is that I'm not sure how you reliably find out what cookies you have in a browser world with first-party isolation. There's theoretically some information about this in Settings → Privacy & Security → Cookies and Site Data → "Manage Data...", but since that's part of the normal Settings UI that normal people use, I'm not sure if it's simplifying things.

PS: Now that I've discovered all of this, I'm not certain if my standard Cookie Quick Manager add-on properly supports first-party isolated cookies. There's this comment on an issue that suggests it does support first-party isolation but not storage partitioning (also). The available Firefox documentation and Settings UI is not entirely clear about whether first-party isolation is now on more or less by default.

(That comment points to Cookie Manager as a potential partition-aware cookie manager.)

(2 comments.)

My flailing around with Firefox's Multi-Account Containers

Chris's Wiki :: blog

By: cks

30 October 2025 at 02:43

I have two separate Firefox environments. One of them is quite locked down so that it blocks JavaScript by default, doesn't accept cookies, and so on. Naturally this breaks a lot of things, so I have a second "just make it work" environment that runs all the JavaScript, accepts all the cookies, and so on (although of course I use uBlock Origin, I'm not crazy). This second environment is pretty risky in the sense that it's going to be heavily contaminated with tracking cookies and so on, so to mitigate the risk (and make it a better environment to test things in), I have this Firefox set to discard cookies, caches, local storage, history, and so on when it shuts down.

In theory how I use this Firefox is that I start it when I need to use some annoying site I want to just work, use the site briefly, and then close it down, flushing away all of the cookies and so on. In practice I've drifted into having a number of websites more or less constantly active in this "accept everything" Firefox, which means that I often keep it running all day (or longer at home) and all of those cookies stick around. This is less than ideal, and is a big reason why I wish Firefox had a 'open this site in a specific profile' feature. Yesterday, spurred on by Ben Zanin's Fediverse comment, I decided to make my "accept everything" Firefox environment more complicated in the pursuit of doing better (ie, throwing away at least some cookies more often).

First, I set up a combination of Multi-Account Containers for the basic multi-container support and FoxyTab to assign wildcarded domains to specific containers. My reason to use Multi-Account Containers and to confine specific domains to specific containers is that both M-A C itself and my standard Cookie Quick Manager add-on can purge all of the cookies and so on for a specific container. In theory this lets me manually purge undesired cookies, or all cookies except desired ones (for example, my active Fediverse login). Of course I'm not likely to routinely manually delete cookies, so I also installed Cookie AutoDelete with a relatively long timeout and with its container awareness turned on, and exemptions configured for the (container-confined) sites that I'm going to want to retain cookies from even when I've closed their tab.

(It would be great if Cookie AutoDelete supported different cookie timeouts for different containers. I suspect it's technically possible, along with other container-aware cookie deletion, since Cookie AutoDelete applies different retention policies in different containers.)

In FoxyTab, I've set a number of my containers to 'Limit to Designated Sites'; for example, my 'Fediverse' container is set this way. The intention is that when I click on an external link in a post while reading my Fediverse feed, any cookies that external site sets don't wind up in the Fediverse container; instead they go either in the default 'no container' environment or in any specific container I've set up for them. As part of this I've created a 'Cookie Dump' container that I've assigned as the container for various news sites and so on where I actively want a convenient way to discard all their cookies and data (which is available through Multi-Account Containers).

Of course if you look carefully, much of this doesn't really require Multi-Account Containers and FoxyTab (or containers at all). Instead I could get almost all of this just by using Cookie AutoDelete to clean out cookies from closed sites after a suitable delay. Containers do give me a bit more isolation between the different things I'm using my "just make it work" Firefox for, and maybe that's important enough to justify the complexity.

(I still have this Firefox set to discard everything when it exits. This means that I have to re-log-in every so often even for the sites where I have Cookie AutoDelete keep cookies, but that's fine.)

(3 comments.)

I wish Firefox Profiles supported assigning websites to profiles

Chris's Wiki :: blog

By: cks

29 October 2025 at 03:23

One of the things that Firefox is working on these days is improving Firefox's profiles feature so that it's easier to use them. Firefox also has an existing feature that is similar to profiles, in containers and the Multi-Account Containers extension. The reason Firefox is tuning up profiles is that containers only separate some things, while profiles separate pretty much everything. A profile has a separate set of about:config settings, add-ons, add-on settings, memorized logins, and so on. I deliberately use profiles to create two separate and rather different Firefox environments. I'd like to have at least two or three more profiles, but one reason I've been lazy is that the more profiles I have, the more complex getting URLs into the right profile is (even with tooling to help).

This leads me to my wish for profiles, which is for profiles to support the kind of 'assign website to profile' and 'open website in profile' features that you currently have with containers, especially with the Multi-Account Containers extension. Actually I would like a somewhat better version than Multi-Account Containers currently offers, because as far as I can see you can't currently say 'all subdomains under this domain should open in container X' and that's a feature I very much want for one of my use cases.

(Multi-Account Containers may be able to do wildcarded subdomains with an additional add-on, but on the other hand apparently it may have been neglected or abandoned by Mozilla.)

Another way to get much of what I want would be for some of my normal add-ons to be (more) container aware. I could get a lot of the benefit of profiles (although not all of them) by using Multi-Account Containers with container aware cookie management in, say, Cookie AutoDelete (which I believe does support that, although I haven't experimented). Using containers also has the advantage that I wouldn't have to maintain N identical copies of my configuration for core extensions and bookmarklets and so on.

(I'm not sure what you can copy from one profile to a new one, and you currently don't seem to get any assistance from Firefox for it, at least in the old profile interface. This is another reason I haven't gone wild on making new Firefox profiles.)

(4 comments.)

What little I want out of web "passkeys" in my environment

Chris's Wiki :: blog

By: cks

25 October 2025 at 03:19

WebAuthn is yet another attempt to do an API for web authentication that doesn't involve passwords but that instead allows browsers, hardware tokens, and so on to do things more securely. "Passkeys" (also) is the marketing term for a "WebAuthn credential", and an increasing number of websites really, really want you to use a passkey for authentication instead of any other form of multi-factor authentication (they may or may not still require your password).

Most everyone that wants you to use passkeys also wants you to specifically use highly secure ones. The theoretically most secure are physical hardware security keys, followed by passkeys that are stored and protected in secure enclaves in various ways by the operating system (provided that the necessary special purpose hardware is available). Of course the flipside of 'secure' is 'locked in', whether locked in to your specific hardware key (or keys, generally you'd better have backups) or locked in to a particular vendor's ecosystem because their devices are the only ones that can possibly use your encrypted passkey vault.

(WebAuthn neither requires nor standardizes passkey export and import operations, and obviously security keys are built to not let anyone export the cryptographic material from them, that's the point.)

I'm extremely not interested in the security versus availability tradeoff that passkeys make in favour of security. I care far more about preserving availability of access to my variety of online accounts than about nominal high security. So if I'm going to use passkeys at all, I have some requirements:

Linux people: is there a passkeys implementation that does not use physical hardware tokens (software only), is open source, works with Firefox, and allows credentials to be backed up and copied to other devices by hand, without going through some cloud service?

I don't think I'm asking for much, but this is what I consider the minimum for me actually using passkeys. I want to be 100% sure of never losing them because I have multiple backups and can use them on multiple machines.

Apparently KeePassXC more or less does what I want (when combined with its Firefox extension), and it can even export passkeys in a plain text format (well, JSON). However, I don't know if anything else can ingest those plain text passkeys, and I don't know if KeePassXC can be told to only do passkeys with the browser and not try to take over passwords.

(But at least a plain text JSON backup of your passkeys can be imported into another KeePassXC instance without having to try to move, copy, or synchronize a KeePassXC database.)

Normally I would ignore passkeys entirely, but an increasing number of websites are clearly going to require me to use some form of multi-factor authentication, no matter how stupid this is (cf), and some of them will probably require passkeys or at least make any non-passkey option very painful. And it's possible that reasonably integrated passkeys will be a better experience than TOTP MFA with my janky minimal setup.

(Of course KeePassXC also supports TOTP, and TOTP has an extremely obvious import process that everyone supports, and I believe KeePassXC will export TOTP secrets if you ask nicely.)

While KeePassXC is okay, what I would really like is for Firefox to support 'memorized passkeys' right along with its memorized passwords (and support some kind of export and import along with it). Should people use them? Perhaps not. But it would put that choice firmly in the hands of the people using Firefox, who could decide on how much security they did or didn't want, not in the hands of websites who want to force everyone to face a real risk of losing their account so that the website can conduct security theater.

(Firefox will never support passkeys this way for an assortment of reasons. At most it may someday directly use passkeys through whatever operating system services expose them, and maybe Linux will get a generic service that works the way I want it to. Nor is Firefox ever going to support 'memorized TOTP codes'.)

(3 comments.)

We need to start doing web blocking for non-technical reasons

Chris's Wiki :: blog

By: cks

20 October 2025 at 03:37

My sense is that for a long time, technical people (system administrators, programmers, and so on) have seen the web as something that should be open by default and by extension, a place where we should only block things for 'technical' reasons. Common technical reasons are a harmful volume of requests or clear evidence of malign intentions, such as probing for known vulnerabilities. Otherwise, if it wasn't harming your website and wasn't showing any intention to do so, you should let it pass. I've come to think that in the modern web this is a mistake, and we need to be willing to use blocking and other measures for 'non-technical' reasons.

The core problem is that the modern web seems to be fragile and is kept going in large part by a social consensus, not technical things such as capable software and powerful servers. However, if we only react to technical problems, there's very little that preserves and reinforces this social consensus, as we're busy seeing. With little to no consequences for violating the social consensus, bad actors are incentivized to skate right up to and even over the line of causing technical problems. When we react by taking only narrow technical measures, we tacitly reward the bad actors for their actions; they can always find another technical way. They have no incentive to be nice or to even vaguely respect the social consensus, because we don't punish them for it.

So I've come to feel that if something like the current web is to be preserved, we need to take action not merely when technical problems arise but also when the social consensus is violated. We need to start blocking things for what I called editorial reasons. When software or people do things that merely shows bad manners and doesn't yet cause us technical problems, we should still block it, either soft (temporarily, perhaps with HTTP 429 Too Many Requests) or hard (permanently). We need to take action to create the web that we want to see, or we aren't going to get it or keep it.

To put it another way, if we want to see good, well behaved browsers, feed readers, URL fetchers, crawlers, and so on, we have to create disincentives for ones that are merely bad (as opposed to actively damaging). In its own way, this is another example of the refutation of Postel's Law. If we accept random crap to be friendly, we get random crap (and the quality level will probably trend down over time).

To answer one potential criticism, it's true that in some sense, blocking and so on for social reasons is not good and is in some theoretical sense arguably harmful for the overall web ecology. On the other hand, the current unchecked situation itself is also deeply harmful for the overall web ecology and it's only going to get worse if we do nothing, with more and more things effectively driven off the open web. We only get to pick the poison here.

(9 comments.)

A Firefox issue and perhaps how handling scaling is hard

Chris's Wiki :: blog

By: cks

7 October 2025 at 03:09

Over on the Fediverse I shared a fun Firefox issue I've just run into:

Today's fun Firefox bug: if I move my (Nightly) Firefox window left and right across my X display, the text inside the window reflows to change its line wrapping back and forth. I have a HiDPI display with non-integer scaling and some other settings, so I'm assuming that Firefox is now suffering from rounding issues where the exact horizontal pixel position changes its idea of the CSS window width, triggering text reflows as it jumps back and forth by a CSS pixel.

(I've managed to reproduce this in a standard Nightly, although so far only with some of my settings.)

Close inspection says that this isn't quite what's happening, and the underlying problem is happening more often than I thought. What is actually happening is that as I move my Firefox window left and right, a thin vertical black line usually appears and disappears at the right edge of the window (past a scrollbar if there is one). Since I can see it on my HiDPI display, I suspect that this vertical line is at least two screen pixels wide. Under the right circumstances of window width, text size, and specific text content, this vertical black bar takes enough width away from the rest of the window to cause Firefox to re-flow and re-wrap text, creating easily visible changes as the window moves.

A variation of this happens when the vertical black bar isn't drawn but things on the right side of the toolbar and the URL bar area will shift left and right slightly as the window is moved horizontally. If the window is showing a scrollbar, the position of the scroll target in the scrollbar will move left and right, with the right side getting ever so slightly wider or returning back to being symmetrical. It's easiest to see this if I move the window sideways slowly, which is of course not something I do often (usually I move windows rapidly).

(This may be related to how X has a notion of sizing windows in non-pixel units if the window asks for it. Firefox in my configuration definitely asks for this; it asserts that it wants to be resized in units of 2 (display) pixels both horizontally and vertically. However, I can look at the state of a Firefox window in X and see that the window size in pixels doesn't change between the black bar appearing and disappearing.)

All of this is visible partly because under X and my window manager, windows can redisplay themselves even during an active move operation. If the window contents froze while I dragged windows around, I probably wouldn't have noticed this for some time. Text reflowing as I moved a Firefox window sideways created a quite attention-getting shimmer.

It's probably relevant that I need unusual HiDPI settings and I've also set Firefox's layout.css.devPixelsPerPx to 1.7 in about:config. That was part of why I initially assumed this was a scaling and rounding issue, and why I still suspect that area of Firefox a bit.

(I haven't filed this as a Firefox bug yet, partly because I just narrowed down what was happening in the process of writing this entry.)

Apache .htaccess files are important because they enable delegation

Chris's Wiki :: blog

By: cks

3 October 2025 at 03:03

Apache's .htaccess files have a generally bad reputation. For example, lots of people will tell you that they can cause performance problems and you should move everything from .htaccess files into your main Apache configuration, using various pieces of Apache syntax to restrict what configuration directives apply to. The result can even be clearer, since various things can be confusing in .htaccess files (eg rewrites and redirects). Despite all of this, .htaccess files are important and valuable because of one property, which is that they enable delegation of parts of your server configuration to other people.

The Apache .htaccess documentation even spells this out in reverse, in When (not) to use .htaccess files:

In general, you should only use .htaccess files when you don't have access to the main server configuration file. [...]

If you operate the server and would be writing the .htaccess file, you can put the contents of the .htaccess in the main server configuration and make your life easier and Apache faster (and you probably should). But if the web server and its configuration isn't managed as a unitary whole by one group, then .htaccess files allow the people managing the overall Apache configuration to safely delegate things to other people on a per-directory basis, using Unix ownership. This can both enable people to do additional things and reduce the amount of work the central people have to do, letting people things scale better.

(The other thing that .htaccess files allow is dynamic updates without having to restart or reload the whole server. In some contexts this can be useful or important, for example if the updates are automatically generated at unpredictable times.)

I don't think it's an accident that .htaccess files emerged in Apache, because one common environment Apache was initially used in was old fashioned multi-user Unix web servers where, for example, every person with a login on the web server might have their own UserDir directory hierarchy. Hence features like suEXEC, so you could let people run CGIs without those CGIs having to run as the web user (a dangerous thing), and also hence the attraction of .htaccess files. If you have a bunch of (graduate) students with their own web areas, you definitely don't want to let all of them edit your departmental web server's overall configuration.

(Apache doesn't solve all your problems here, at least not in a simple configuration; you're still left with the multiuser PHP problem. Our solution to this problem is somewhat brute force.)

These environments are uncommon today but they're not extinct, at least at universities like mine, and .htaccess files (and Apache's general flexibility) remain valuable to us.

(2 comments.)

Syndication feed fetchers, HTTP redirects, and conditional GET

Chris's Wiki :: blog

By: cks

29 September 2025 at 03:49

In response to my entry on how ETag values are specific to a URL, a Wandering Thoughts reader asked me in email what a syndication feed reader (fetcher) should do when it encounters a temporary HTTP redirect, in the context of conditional GET. I think this is a good question, especially if we approach it pragmatically.

The specification compliant answer is that every final (non-redirected) URL must have its ETag and Last-Modified values tracked separately. If you make a conditional GET for URL A because you know its ETag or Last-Modified (or both) and you get a temporary HTTP redirection to another URL B that you don't have an ETag or Last-Modified for, you can't make a conditional GET. This means you have to insure that If-None-Match and especially If-Modified-Since aren't copied from the original HTTP request to the newly re-issued redirect target request. And when you make another request for URL A later, you can't send a conditional GET using ETag or Last-Modified values you got from successfully fetching URL B; you either have to use the last values observed for URL A or make an unconditional GET. In other words, saved ETag and Last-Modified values should be per-URL properties, not per-feed properties.

(Unfortunately this may not fit well with feed reader code structures, data storage, or uses of low-level HTTP request libraries that hide things like HTTP redirects from you.)

Pragmatically, you can probably get away with re-doing the conditional GET when you get a temporary HTTP redirect for a feed, with the feed's original saved ETag and Last-Modified information. There are three likely cases for a temporary HTTP redirection of a syndication feed that I can think of:

You're receiving a generic HTTP redirection to some sort of error page that isn't a valid syndication feed. Your syndication feed fetcher isn't going to do anything with a successful fetch of it (except maybe add an 'error' marker to the feed), so a conditional GET that fools you with "nothing changed" is harmless.
You're being redirected to an alternate source of the normal feed, for example a feed that's normally dynamically generated might serve a (temporary) HTTP redirect to a static copy under high load. If the conditional GET matches the ETag (probably unlikely in practice) or the Last-Modified (more possible), then you almost certainly have the most current version and are fine, and you've saved the web server some load.
You're being (temporarily) redirected to some kind of error feed; a valid syndication feed that contains one or more entries that are there to tell the person seeing them about a problem. Here, the worst thing that happens if your conditional GET fools you with "nothing has changed" is that the person reading the feed doesn't see the error entry (or entries).

The third case is a special variant of an unlikely general case where the normal URL and the redirected URL are both versions of the feed but each has entries that the other doesn't. In this general case, a conditional GET that fools you with a '304 Not Modified' will cause you to miss some entries. However, this should cure itself when the temporary HTTP redirect stops happening (or when a new entry is published to the temporary location, which should change its ETag and reset its Last-Modified date to more or less now).

A feed reader that keeps a per-feed 'Last-Modified' value and updates it after following a temporary HTTP redirect is living dangerously. You may not have the latest version of the non-redirected feed but the target of the HTTP redirection may be 'more recent' than it for various reasons (even if it's a valid feed; if it's not a valid feed then blindly saving its ETag and Last-Modified is probably quite dangerous). When the temporary HTTP redirection goes away and the normal feed's URL resumes responding with the feed again, using the target's "Last-Modified" value for a conditional GET of the original URL could cause you to receive "304 Not Modified" until the feed is updated again (and its Last-Modified moves to be after your saved value), whenever that happens. Some feeds update frequently; others may only update days or weeks later.

Given this and the potential difficulties of even noticing HTTP redirects (if they're handled by some underlying library or tool), my view is that if a feed provides both an ETag and a Last-Modified, you should save and use only the ETag unless you're sure you're going to handle HTTP redirects correctly. An ETag could still get you into trouble if used across different URLs, but it's much less likely (see the discussion at the end of my entry about Last-Modified being specific to the URL).

(All of this is my view as someone providing syndication feeds, not someone writing syndication feed fetchers. There may be practical issues I'm unaware of, since the world of feeds is very large and it probably contains a lot of weird feed behavior (to go with the weird feed fetcher behavior).)

(3 comments.)

The HTTP Last-Modified value is specific to the URL (technically so is the ETag value)

Chris's Wiki :: blog

By: cks

28 September 2025 at 01:08

Last time around I wrote about how If-None-Match values (which come from ETag values) must come from the actual URL itself, not (for example) from another URL that you were at one point redirected to. In practice, this is only an issue of moderate concern for ETag/If-None-Match; you can usually make a conditional GET using an ETag from another URL and get away with it. This is very much an issue if you make the mistake of doing the same thing with an If-Modified-Since header based on another URL's Last-Modified header. This is because the Last-Modified header value isn't unique to a particular document, in a way that ETag values can often be.

If you take the Last-Modified timestamp from URL A and perform a conditional GET for URL B with an 'If-Modified-Since' of that timestamp, the web server may well give you exactly what you asked for but not what you wanted by saying 'this hasn't been modified since then' even though the contents of those URLs are entirely different. You told the web server to decide purely on the basis of timestamps without reference to anything that might even vaguely specify the content, and so it did. This can happen even if the server is requiring an exact timestamp match (as it probably should), because there are any number of ways for the 'Last-Modified' timestamp of a whole bunch of URLs to be exactly the same because some important common element of them was last updated at that point.

(This is how DWiki works. The Last-Modified date of a page is the most recent timestamp of all of the elements that went into creating it, so if I change some shared element, everything will promptly take on the Last-Modified of that element.)

This means that if you're going to use Last-Modified in conditional GETs, you must handle HTTP redirects specially. It's actively dangerous (to actually getting updates) to mingle Last-Modified dates from the original URL and the redirection URL; you either have to not use Last-Modified at all, or track the Last-Modified values separately. For things that update regularly, any 'missing the current version' problems will cure themselves eventually, but for infrequently updated things you could go quite a while thinking that you have the current content when you don't.

In theory this is also true of ETag values; the specification allows them to be calculated in ways that are URL-specific (the specification mentions that the ETag might be a 'revision number'). A plausible implementation of serving a collection of pages from a Git repository could use the repository's Git revision as the common ETag for all pages; after all, the URL (the page) plus that git revision uniquely identifies it, and it's very cheap to provide under the right circumstances (eg, you can record the checked out git revision).

In practice, common ways of generating ETags will make them different across different URLs, potentially unless the contents are the same. DWiki generates ETag values using a cryptographic hash, so two different URLs will only have the same ETag if they have the same contents, which I believe is a common approach for pages that are generated dynamically. Apache generates ETag values for static files using various file attributes that will be different for different files, which is probably also a common approach for things that serve static files. Pragmatically you're probably much safer sending an ETag value from one URL in an If-None-Match header to another URL (for example, through repeating it while following a HTTP redirection). It's still technically wrong, though, and it may cause problems someday.

(This feels obvious but it was only today that I realized how it interacts with conditional GETs and HTTP redirects.)

If-None-Match values must come from the actual URL itself

Chris's Wiki :: blog

By: cks

24 September 2025 at 16:55

Because I recently looked at the web server logs for Wandering Thoughts, I said something on the Fediverse:

It's impressive how many ways feed readers screw up ETag values. Make up their own? Insert ETags obtained from the target of a HTTP redirect of another request? Stick suffixes on the end? Add their own quoting? I've seen them all.

(And these are just the ones that I can readily detect from the ETag format being wrong for the ETags my techblog generates.)

(Technically these are If-None-Match values, not ETag values; it's just that the I-N-M value is supposed to come from an ETag you returned.)

One of these mistakes deserves special note, and that's the HTTP redirect case. Suppose you request a URL, receive a HTTP 302 temporary redirect, follow the redirect, and get a response at the new URL with an ETag value. As a practical matter, you cannot then present that ETag value in an If-None-Match header when you re-request the original URL, although you could if you re-requested the URL that you were redirected to. The two URLs are not the same and they don't necessarily have the same ETag values or even the same format of ETags.

(This is an especially bad mistake for a feed fetcher to make here, because if you got a HTTP redirect that gives you a different format of ETag, it's because you've been redirected to a static HTML page served directly by Apache (cf) and it's obviously not a valid syndication feed. You shouldn't be saving the ETag value for responses that aren't valid syndication feeds, because you don't want to get them again.)

This means that feed readers can't just store 'an ETag value' for a feed. They need to associate the ETag value with a specific, final URL, which may not be the URL of the feed (because said feed URL may have been redirected). They also need to (only) make conditional requests when they have an ETag for that specific URL, and not copy the If-None-Match header from the initial GET into a redirected GET.

This probably clashes with many low level HTTP client APIs, which I suspect want to hide HTTP redirects from the caller. For feed readers, such high level APIs are a mistake. They actively need to know about HTTP redirects so that, for example, they can consider updating their feed URL if they get permanent HTTP redirects to a new URL. And also, of course, to properly handle conditional GETs.

(4 comments.)

A hack: outsourcing web browser/client checking to another web server

Chris's Wiki :: blog

By: cks

24 September 2025 at 03:18

A while back on the Fediverse, I shared a semi-cursed clever idea:

Today I realized that given the world's simplest OIDC IdP (one user, no password, no prompting, the IdP just 'logs you in' if your browser hits the login URL), you could put @cadey's Anubis in front of anything you can protect with OIDC authentication, including anything at all on an Apache server (via mod_auth_openidc). No need to put Anubis 'in front' of anything (convenient for eg static files or CGIs), and Anubis doesn't even have to be on the same website or machine.

This can be generalized, of course. There are any number of filtering proxies and filtering proxy services out there that will do various things for you, either for free or on commercial terms; one example of a service is geoblocking that's maintained by someone else who's paid to be on top of it and be accurate. Especially with services, you may not want to put them in front of your main website (that gives the service a lot of power), but you would be fine with putting a single-purpose website behind the service or the proxy, if your main website can use the result. With the world's simplest OIDC IdP, you can do that, at least for anything that will do OIDC.

(To be explicit, yes, I'm partly talking about Cloudflare.)

This also generalizes in the other direction, in that you don't necessarily need to use OIDC. You just need some system for passing authenticated information back and forth between your main website and your filtered, checked, proxied verification website. Since you don't need to carry user identity information around this can be pretty simple (although it's going to involve some cryptography, so I recommend just using OIDC or some well-proven option if you can). I've thought about this a bit and I'm pretty certain you can make a quite simple implementation.

(You can also use SAML if you happen to have an extremely simple SAML server and appropriate SAML clients, but really, why. OIDC is today's all-purpose authentication hammer.)

A custom system can pass arbitrary information back and forth between the main website and the verifier, so you can know (for example) if the two saw the same client details. I think you can do this to some extent with OIDC as well if you have a custom IdP, because nothing stops your IdP and your OIDC client from agreeing on some very custom OIDC claims, such as (say) 'clientip'.

(I don't know of any such minimal OIDC server, although I wouldn't be surprised if one exists, probably as a demonstration or test server. And I suppose you can always put a banner on your OIDC IdP's login page that tells people what login and password to use, if you can only find a simple IdP that requires an actual login.)

(2 comments.)

Why Firefox's media autoplay settings are complicated and imperfect

Chris's Wiki :: blog

By: cks

8 September 2025 at 03:25

In theory, a website that wanted to play video or audio could throw in a '<video controls ...>' or '<audio controls ...>' element in the HTML of the page and be done with it. This would make handling playing media simple and blocking autoplay reliable; you'd ignore the autoplay element and the person using the browser would directly trigger playing media by interacting with things that the browser directly controlled and so the browser could know for sure that a person had directly clicked on them and the media should be played.

As anyone who's seen websites with audio and video on the web knows, in practice almost no one does it this way, with browser controls on the <video> or <audio> element. Instead, everyone displays controls of their own somehow (eg as HTML elements styled through CSS), attaches JavaScript actions to them, and then uses the HTMLMediaElement browser API to trigger playback and various other things. As a result of this use of JavaScript, browsers in general and Firefox in particular no longer have a clear, unambiguous view of your intentions to play media. At best, all they can know is that you interacted with the web page, this interaction triggered some JavaScript, and the JavaScript requested that media play.

(Browsers can know somewhat of how you interacted with a web page, such as whether you clicked or scrolled or typed a key.)

On good, well behaved websites, this interaction is with visually clear controls (such as a visual 'play' button) and the JavaScript that requests media playing is directly attached to those controls. And even on these websites, JavaScript may later legitimately act asynchronously to request more playing of things, or you may interact with media playback in other ways (such as spacebar to pause and then restart media playing). On not so good websites, well, any piece of JavaScript that manages to run can call HTMLMediaElement.play() to try to start playing the media. There are lots of ways to have JavaScript run automatically and so a web page can start trying to play media the moment its JavaScript starts running, and it can keep trying to trigger playback over and over again if it wants to through timers or suchlike.

Since Firefox only blocking the actual autoplay attribute and allowing JavaScript to trigger media playing any time it wants to would be a pretty obviously bad 'Block Autoplay' experience, it must try harder. Firefox's approach is to (also) block use of HTMLMediaElement.play() until you have done some 'user gesture' on the page. As far as I can tell from Firefox's description of this, the list of 'user gestures' is fairly expansive and covers much of how you interact with a page. Certainly, if a website can cause you to click on something, regardless of what it looks like, this counts as a 'user gesture' in Firefox.

(I'm sure that Firefox's selection of things that count as 'user gestures' are drawn from real people on real hardware doing things to deliberately trigger playback, including resuming playback after it's been paused by, for example, tapping spacebar.)

In Firefox, this makes it quite hard to actually stop a bad website from playing media while preserving your ability to interact with the site. Did you scroll the page with the spacebar? I think that counts as a user gesture. Did you use your mouse scroll wheel? Probably a user gesture. Did you click on anything at all, including to dismiss some banner? Definitely a user gesture. As far as I can tell, the only reliable way you can prevent a web page from starting media playback is to immediately close the page. Basically anything you do to use it is dangerous.

Firefox does have a very strict global 'no autoplay' policy that you can turn on through about:config, which they call click-to-play, where Firefox tries to limit HTMLMediaElement.play() to being called as the direct result of a JavaScript event handler. However, their wiki notes that this can break some (legitimate) websites entirely (well, for media playback), and it's a global setting that gets in the way of some things I want; you can't set it only for some sites. And even with click-to-play, if a website can get you to click on something of its choice, it's game over as far as I know; if you have to click or tap a key to dismiss an on-page popup banner, the page can trigger media playing from that event handler.

All of this is why I'd like a per-website "permanent mute" option for Firefox. As far as I know, there's literally no other way in standard Firefox to reliably prevent a potentially bad website (or advertising network that it uses) from playing media on you.

(I suspect that you can defeat a lot of such websites with click-to-play, though.)

PS: Muting a tab in Firefox is different from stopping media playback (or blocking it from starting). All it does is stop Firefox from outputting audio from that tab (to wherever you're having Firefox send audio). Any media will 'play' or continue to play, including videos displaying moving things and being distracting.

(3 comments.)

HTTP headers that tell syndication feed fetchers how soon to come back

Chris's Wiki :: blog

By: cks

4 September 2025 at 03:17

Programs that fetch syndication feeds should fetch them only every so often. But how often? There are a variety of ways to communicate this, and for my own purposes I want to gather them in one place.

I'll put the summary up front. For Atom syndication feeds, your HTTP feed responses should contain a Cache-Control: max-age=... HTTP header that gives your desired retry interval (in seconds), such as '3600' for pulling the feed once an hour. If and when people trip your rate limits and get HTTP 429 responses, your 429s should include a Retry-After header with how long you want feed readers to wait (although they won't).

There are two syndication feed formats in general usage, Atom and RSS2. Although generally not great (and to be avoided), RSS2 format feeds can optionally contain a number of elements to explicitly tell feed readers how frequently they should poll the feed. The Atom syndication feed format has no standard element to communicate polling frequency. Instead, the nominally standard way to do this is through a general Cache-Control: max-age=... HTTP header, which gives a (remaining) lifetime in seconds. You can also set an Expires header, which gives an absolute expiry time, but not both.

(This information comes from Daniel Aleksandersen's Best practices for syndication feed caching. One advantage of HTTP headers over feed elements is that they can be returned on HTTP 304 Not Modified responses; one drawback is that you need to be able to set HTTP headers.)

If you have different rate limit policies for conditional GET requests and unconditional ones, you have a choice to make about the time period you advertise on successful unconditional GETs of your feed. Every feed reader has to do an unconditional GET the first time it fetches your feed, and many of them will periodically do unconditional GETs for various reasons. You could choose to be optimistic, assume that the feed reader's next poll will be a conditional GET, and give it the conditional GET retry interval, or you could be pessimistic and give it a longer unconditional GET one. My personal approach is to always advertise the conditional GET retry interval, because I assume that if you're not going to do any conditional GETs you're probably not paying attention to my Cache-Control header either.

As rachelbythebay's ongoing work on improving feed reader behavior has uncovered, a number of feed readers will come back a bit earlier than your advertised retry interval. So my view is that if you have a rate limit, you should advertise a retry interval that is larger than it. On Wandering Thoughts my current conditional GET feed rate limit is 45 minutes, but I advertise a one hour max-age (and I would like people to stick to once an hour).

(Unconditional GETs of my feeds are rate limited down to once every four hours.)

Once people trip your rate limits and start getting HTTP 429 responses, you theoretically can signal how soon they can come back with a Retry-After header. The simplest way to implement this is to have a constant value that you put in this header, even if your actual rate limit implementation would allow a successful request earlier. For example, if you rate limit to one feed fetch every half hour and a feed fetcher polls after 20 minutes, the simple Retry-After value is '1800' (half an hour in seconds), although if they tried again in just over ten minutes they could succeed (depending on how you implement rate limits). This is what I currently do, with a different Retry-After (and a different rate limit) for conditional GET requests and unconditional GETs.

My suspicion is that there are almost no feed fetchers that ignore your Cache-Control max-age setting but that honor your HTTP 429 Retry-After setting (or that react to 429s at all). Certainly I see a lot of feed fetchers here behaving in ways that very strongly suggest they ignore both, such as rather frequent fetch attempts. But at least I tried.

Sidebar: rate limit policies and feed reader behavior

When you have a rate limit, one question is whether failed (rate limited) requests should count against the rate limit, or if only successful ones count. If you nominally allow one feed fetch every 30 minutes and a feed reader fetches at T (successfully), T+20, and T+33, this is the difference between the third fetch failing (since it's less than 30 minutes from the previous attempt) or succeeding (since it's more than 30 minutes from the last successful fetch).

There are various situations where the right answer is that your rate limit counts from the last request even if the last request failed (what Exim calls a strict ratelimit). However, based on observed feed reader behavior, doing this strict rate limiting on feed fetches will result in quite a number of syndication feed readers never successfully fetching your feed, because they will never slow down and drop under your rate limit. You probably don't want this.

(One comment.)

Mapping from total requests per day to average request rates

Chris's Wiki :: blog

By: cks

3 September 2025 at 03:43

Suppose, not hypothetically, that a single IP address with a single User-Agent has made 557 requests for your blog's syndication feed in about 22 and a half hours (most of which were rate-limited and got HTTP 429 replies). If we generously assume that these requests were distributed evenly over one day (24 hours), what was the average interval between requests (the rate of requests)? The answer is easy enough to work out and it's about two and a half minutes between requests, if they were evenly distributed.

I've been looking at numbers like this lately and I don't feel like working out the math each time, so here is a table of them for my own future use.

Total requests	Theoretical interval (rate)
6	Four hours
12	Two hours
24	One hour
32	45 minutes
48	30 minutes
96	15 minutes
144	10 minutes
288	5 minutes
360	4 minutes
480	3 minutes
720	2 minutes
1440	One minute
2880	30 seconds
5760	15 seconds
8640	10 seconds
17280	5 seconds
43200	2 seconds
86400	One second

(This obviously isn't comprehensive; instead I want it to give me a ballpark idea, and I care more about higher request counts than lower ones. But not too high because I mostly don't deal with really high rates. Every four hours and every 45 minutes are relevant to some ratelimiting I do.)

Yesterday there were about 20,240 requests for the main syndication feed for Wandering Thoughts, which is an aggregate rate of more than one request every five seconds. About 10,570 of those requests weren't blocked in various ways or ratelimited, which is still more than one request every ten seconds (if they were evenly spread out, which they probably weren't).

(There were about 48,000 total requests to Wandering Thoughts, and about 18,980 got successful responses, although almost 2,000 of those successful responses were a single rogue crawler that's now blocked. This is of course nothing compared to what a busy website sees. Yesterday my department's web server saw 491,900 requests, although that seems to have been unusually high. Interested parties can make their own tables for that sort of volume level.)

It's a bit interesting to see this table written out this way. For example, if I thought about it I knew there was a factor of ten difference between one request every ten seconds and one request every second, but it's more concrete when I see the numbers there with the extra zero.

I wish Firefox had some way to permanently mute a website

Chris's Wiki :: blog

By: cks

31 August 2025 at 02:27

Over on the Fediverse, I had a wish:

My kingdom for a way to tell Firefox to never, ever play audio and/or video for a particular site. In other words, a permanent and persistent mute of that site. AFAIK this is currently impossible.

(For reasons, I cannot set media.autoplay.blocking_policy to 2 generally. I could if Firefox had a 'all subdomains of ...' autoplay permission, but it doesn't, again AFAIK.)

(This is in a Firefox setup that doesn't have uMatrix and that runs JavaScript.)

Sometimes I visit sites in my 'just make things work' Firefox instance that has JavaScript and cookies and so on allowed (and throws everything away when it shuts down), and it turns out that those sites have invented exceedingly clever ways to defeat Firefox's default attempts to let you block autoplaying media (and possibly their approach is clever enough to defeat even the strict 'click to start' setting for media.autoplay.blocking_policy). I'd like to frustrate those sites, especially ones that I keep winding up back on for various reasons, and never hear unexpected noises from Firefox.

(In general I'd probably like to invert my wish, so that Firefox never played audio or video by default and I had to specifically enable it on a site by site basis. But again this would need an 'all subdomains of' option. This version might turn out to be too strict, I'd have to experiment.)

You can mute a tab, but only once it starts playing, and your mute isn't persistent. As far as I know there's no (native) way to get Firefox to start a tab muted, or especially to always start tabs for a site in a muted state, or to disable audio and/or video for a site entirely (the way you can deny permission for camera or microphone access). I'm somewhat surprised that Firefox doesn't have any option for 'this site is obnoxious, put them on permanent mute', because there are such sites out there.

Both uMatrix and apparently NoScript can selectively block media, but I'd have to add either of them to this profile and I broadly want it to be as plain as reasonable. I do have uBlock Origin in this profile (because I have it in everything), but as far as I can tell it doesn't have a specific (and selective) media blocking option, although it's possible you can do clever things with filter rules, especially if you care about one site instead of all sites.

(I also think that Firefox should be able to do this natively, but evidently Firefox disagrees with me.)

PS: If Firefox actually does have an apparently well hidden feature for this, I'd love to know about it.

(4 comments.)

Why Wandering Thoughts has fewer comment syndication feeds than yesterday

Chris's Wiki :: blog

By: cks

27 August 2025 at 03:05

Over on the Fediverse I said:

My techblog used to offer Atom syndication feeds for the comments on individual entries. I just turned that off because it turns out to be a bad idea on the modern web when you have many years of entries. There are (were) any number of 'people' (feed things) that added the comment feeds for various entries years ago and then never took them out, despite those entries being years old and in some cases never having gotten comments in the first place.

DWiki, the engine behind Wandering Thoughts, is nothing if not general. Syndication feeds, for example, are a type of 'view' over a directory hierarchy, and are available for both pages and comments. A regular (page) syndication feed view can only be done over (on) a directory, because if it was applied to an individual page the feed would only ever contain that page. However, when I wrote DWiki it was obvious that a comment syndication feed for a particular page made sense; it would give you all of the comments 'under' that page (ie, on it). And so for almost all of the time that Wandering Thoughts has been in operation, you could have looked down to the bottom of an entry's page (on the web) and seen in small type 'Atom Syndication: Recent Comments' (with the 'recent comments' being a HTML link giving you the URL of that page's comment feed).

(The comment syndication feed for a directory is all comments on all pages underneath the directory.)

That's gone now, because I decided that it didn't make sense in what Wandering Thoughts has become and because I was slowly accumulating feed readers that were pulling the comment syndication feeds for more and more entries. This is exactly the behavior I should have expected from feed readers from the start; once someone puts a feed in, that feed is normally forever even if it's extremely inactive or has never had an entry. The feed reader will dutifully poll every feed for years to come (well, certainly every feed that responds with HTTP success and a valid syndication feed, which all of my comment feeds did).

(There weren't very many pages having their comment syndication feeds hit, but there were enough that I kept noticing them, especially when I added things like hacky rate limiting for feed fetching. I actually put in some extra hacks to deal with how requests for these feeds interacted with my rate limiting.)

There are undoubtedly places on the Internet where discussion (in the form of comments) continues on for years on certain pages, and so a comment feed for an individual page could make sense; you really might keep up (in your feed reader) with a slow moving conversation that lasts years. Other places on the Internet put definite cut-offs on further discussion (comments) on individual pages, which provides a natural deadline to turn off the page's comment syndication feed. But neither of those profiles describes Wandering Thoughts, where my entries remain open for comments more or less forever (and sometimes people do comment on quite old entries), but comments and discussions don't tend to go on for very long.

Of course, the other thing that this change prevents is that it stops (LLM) web crawlers from trying to crawl all of those URLs for comment syndication feeds. You can't crawl URLs that aren't advertised any more and no longer exist (well, sort of, they technically exist but the code for handling them arranges to return 404s if the new 'no comment feeds for actual pages' configuration option is turned on).

(2 comments.)

Websites and web developers mostly don't care about client-side problems

Chris's Wiki :: blog

By: cks

23 August 2025 at 03:30

In response to my entry on the fragility of the web in the face of the crawler plague, Jukka said in a comment:

While I understand the server-side frustrations, I think the corresponding client-side frustrations have largely been lacking from the debates around the Web.

For instance, CloudFlare now imposes heavy-handed checks that take a few seconds to complete. [...]

This is absolutely true but it's not new, and it goes well beyond anti-crawler and anti-robot defenses. As covered by people like Alex Russell, it's routine for websites to ignore most real world client side concerns (also, and including on desktops). Just recently (as of August 2025), Github put out a major update that many people are finding immensely slow even on developer desktops. If we can't get web developers to care about common or majority experiences for their UI, which in some sense has relatively little on the line, the odds of web site operators caring when their servers are actually experiencing problems (or at least annoyances) is basically nil.

Much like browsers have most of the power in various relationships with, for example, TLS certificate authorities, websites have most of the power in their relationship to clients (ie, us). If people don't like what a website is doing, their only option is generally a boycott. Based on the available evidence so far, any boycotts over things like CAPTCHA challenges have been ineffective so far. Github can afford to give people a UI with terrible performance because the switching costs are sufficiently high that they know most people won't.

(Another view is that the server side mostly doesn't notice or know that they're losing people; the lost people are usually invisible, with websites only having much visibility into the people who stick around. I suspect that relatively few websites do serious measurement of how many people bounce off or stop using them.)

Thus, in my view, it's not so much that client-side frustrations have been 'lacking' from debates around the web, which makes it sound like client side people haven't been speaking up, as that they've been actively ignored because, roughly speaking, no one on the server side cares about client-side frustrations. Maybe they vaguely sympathize, but they care a lot more about other things. And it's the web server side who decides how things operate.

(The fragility exposed by LLM crawler behavior demonstrates that clients matter in one sense, but it's not a sense that encourages website operators to cooperate or listen. Rather the reverse.)

I'm in no position to throw stones here, since I'm actively making editorial decisions that I know will probably hurt some real clients. Wandering Thoughts has never been hammered by crawler load the way some sites have been; I merely decided that I was irritated enough by the crawlers that I was willing to throw a certain amount of baby out with the bathwater.

(One comment.)

The current (2025) crawler plague and the fragility of the web

Chris's Wiki :: blog

By: cks

21 August 2025 at 03:33

These days, more and more people are putting more and more obstacles in the way of the plague of crawlers (many of them apparently doing it for LLM 'AI' purposes), me included. Some of these obstacles involve attempting to fingerprint unusual aspects of crawler requests, such as using old browser User-Agents or refusing to accept compressed things in an attempt to avoid gzip bombs; other obstacles may involve forcing visitors to run JavaScript, using CAPTCHAs, or relying on companies like Cloudflare to block bots with various techniques.

On the one hand, I sort of agree that these 'bot' (crawler) defenses are harmful to the overall ecology of the web. On the other hand, people are going to do whatever works for them for now, and none of the current alternatives are particularly good. There's a future where much of the web simply isn't publicly available any more, at least not to anonymous people.

One thing I've wound up feeling from all this is that the current web is surprisingly fragile. A significant amount of the web seems to have been held up by implicit understandings and bargains, not by technology. When LLM crawlers showed up and decided to ignore the social things that had kept those parts of the web going, things started coming down all over the place.

(This isn't new fragility; the fragility was always there.)

Unfortunately, I don't see a technical way out from this (and I'm not sure I see any realistic way in general). There's no magic wand that we can wave to make all of the existing websites, web apps, and so on not get impaired by LLM crawlers when the crawlers persist in visiting everything despite being told not to, and on top of that we're not going to make bandwidth free. Instead I think we're looking at a future where the web ossifies for and against some things, and more and more people see catgirls.

(I feel only slightly sad about my small part in ossifying some bits of the web stack. Another part of me feels that a lot of web client software has gotten away with being at best rather careless for far too long, and now the consequences are coming home to roost.)

(8 comments.)

How not to check or poll URLs, as illustrated by Fediverse software

Chris's Wiki :: blog

By: cks

18 August 2025 at 02:43

Over on the Fediverse, I said some things:

[on April 27th:]
A bit of me would like to know why the Akkoma Fediverse software is insistently polling the same URL with HEAD then GET requests at five minute intervals for days on end. But I will probably be frustrated if I turn over that rock and applying HTTP blocks to individual offenders is easier.

(I haven't yet blocked Akkoma in general, but that may change.)

[the other day:]
My patience with the Akkoma Fediverse server software ran out so now all attempts by an Akkoma instance to pull things from my techblog will fail (with a HTTP redirect to a static page that explains that Akkoma mis-behaves by repeatedly fetching URLs with HEAD+GET every few minutes). Better luck in some future version, maybe, although I doubt the authors of Akkoma care about this.

(The HEAD and GET requests are literally back to back, with no delay between them that I've ever observed.)

Akkoma is derived from Pleroma and I've unsurprisingly seen Pleroma also do the HEAD then GET thing, but so far I haven't seen any Pleroma server showing up with the kind of speed and frequency that (some) Akkoma servers do.

These repeated HEADs and GETs are for Wandering Thoughts entries that haven't changed. DWiki is carefully written to supply valid HTTP Last-Modified and ETag, and these values are supplied in replies to both HEAD and GET requests. Despite all of this, Akkoma is not doing conditional GETs and is not using the information from the HEAD to avoid doing a GET if neither header has changed its value from the last time. Since Akkoma is apparently completely ignoring the result of its HEAD request, it might as well not make the HEAD request in the first place.

If you're going to repeatedly poll a URL, especially every five or ten minutes, and you want me to accept your software, you must do conditional GETs. I won't like you and may still arrange to give you HTTP 429s for polling so fast, but I most likely won't block you outright. Polling every five or ten minutes without conditional GET is completely unacceptable, at least to me (other people probably don't notice or care).

My best guess as to why Akkoma is polling the URL at all is that it's for "link previews". If you link to something in a Fediverse post, various Fediverse software will do the common social media thing of trying to embed some information about the target of the URL into the post as it presents it to local people; for plain links with no special handling, this will often show the page title. As far as the (rapid) polling goes, I can only guess that Akkoma has decided that it is extremely extra special and it must update its link preview information very rapidly should the linked URL do something like change the page title. However, other Fediverse server implementations manage to do link previews without repeatedly polling me (much less the HEAD then immediately a GET thing).

(On the global scale of things this amount of traffic is small beans, but it's my DWiki and I get to be irritated with bad behavior if I want to, even if it's small scale bad behavior.)

(One comment.)

Introducing the illumos Cafe: Another Cozy Corner for OS Diversity

IT+Notes

By: Stefano Marinelli

18 August 2025 at 07:04

Introducing illumos Cafe: a community-run hub on illumos, inspired by BSD Cafe. Fediverse-ready (Mastodon, snac). Built for OS diversity, transparency, and positivity.

Typepad Is Shutting Down Next Month

Pixel Envy

By: Nick Heer

28 August 2025 at 02:51

Typepad:

After September 30, 2025, access to Typepad – including account management, blogs, and all associated content – will no longer be available. Your account and all related services will be permanently deactivated.

I have not thought about Typepad in years, and I am certain I am not alone. That is not a condemnation; Typepad occupies a particular time and place on the web. As with anything hosted, however, users are unfortunately dependent on someone else’s interest in maintaining it.

If you have anything hosted at Typepad, now is a good time to back it up.

⌥ Permalink

Interview With MacSurfer’s New Owner, Ken Turner

Pixel Envy

By: Nick Heer

16 August 2025 at 20:56

Nice scoop from Eric Schwarz:

Over the past week, I’ve been working to track down the new owner of MacSurfer’s Headline News, a beloved site that shut down in 2020 and has recently had somewhat mysterious revival. Fortunately, after some digging that didn’t really lead anywhere, I received an email from its new owner, Ken Turner, and he graciously took the time to answer a few questions about the new project.

Turner sounds like a great steward to carry on the MacSurfer legacy. Even in an era of well-known aggregators like Techmeme and massive forums like Hacker News and Reddit, I think there is still a role for a smaller and more focused media tracking site.

I am uncertain what the role of BackBeat Media is in all this. I have not heard from Dave Hamilton or anyone there to confirm if they even have a role.

⌥ Permalink

MacSurfer Returns

Pixel Envy

By: Nick Heer

14 August 2025 at 17:39

Five years ago, Apple and tech news aggregator MacSurfer announced it was shutting down. The site was still accessible albeit in a stopped-time state, and it seemed that is how it would sit until the server died.

In June, though, MacSurfer was relaunched. The design has been updated and it is no longer as technically simple as it once was, but — charmingly — the logo appears to be the exact same static GIF as always. I cannot find any official announcement of its return.

Eric Schwarz:

It looks like Macsurfer is coming back, but I can’t find any details or who’s behind it? I really hope it’s not AI slop or someone trying to make a buck off nostalgia like iLounge or TUAW.

I had the same question, so I started digging. MxToolbox reveals a txt record on the domain for validating with Google apps, registered to BackBeat Media. BackBeat’s other properties include the Mac Observer, AppleInsider, and PowerPage. A review of historical MacSurfer txt records using SecurityTrails indicates the site has been with Backbeat Media since at least 2011, even though BackBeat’s site has not listed MacSurfer even when it was actively updated.

I cannot confirm the ownership is the same yet but I have asked Dave Hamilton, of BackBeat, and will update this if I hear back.

⌥ Permalink

New Article on BSD Cafe Journal: WordPress on FreeBSD with BastilleBSD

IT+Notes

By: Stefano Marinelli

21 July 2025 at 07:30

A new article on running WordPress on FreeBSD with BastilleBSD has been published on the BSD Cafe Journal, plus a small update on future technical content.

A logic to Apache accepting query parameters for static files

Chris's Wiki :: blog

By: cks

19 July 2025 at 03:35

One of my little web twitches is the lax handling of unknown query parameters. As part of this twitch I've long been a bit irritated that Apache accepts query parameters even on static files, when they definitely have no meaning at all. You could say that this is merely Apache being accepting in general, but recently I noticed a combination of Apache features that can provide an additional reason for Apache to do this.

Apache has various features to redirect from old URLs on your site to new URLs, such as Redirect and RewriteRule. As covered in the relevant documentation for each of them, these rewrites preserve query parameters (although for RewriteRule you can turn that off with the QSD flag). This behavior makes sense in a lot of cases; if you've moved an application from one URL to another (or from one host to another) and it uses query parameters, you almost certainly want the query parameters to carry over with the HTTP redirection that people using old URLs will get.

(Here by 'an application' I mean anything that accepts and acts on query parameters. It might be a CGI, a PHP page or set of pages, a reverse proxy to something else, a Django application implemented with mod_wsgi, or various other things.)

A lot of the time if you use a redirect in Apache on URLs for an application, you'll be sending people to the new location of that application or its replacement. However, some of the time you'll be redirecting from an application to a static page, for example a page that says "this application has gone away". At least by default, your redirection from the application to the static page will carry query parameters along with it, and it would be a bad experience (for the people visiting and you) if the default result was that Apache served some sort of error page because it received query parameters on a static file.

(A closely related change is replacing a single-URL application, such as a basic CGI, with a static web page. Maybe the whole thing is no longer supported, or maybe everything now has a single useful response regardless of query parameters. Here again you can legitimately receive query parameters on a static file.)

Realizing this made me more sympathetic to Apache's behavior of accepting query parameters on static files. It's a relatively reasonable pragmatic choice even if (like me) you're not one of the people who feel unknown query parameters should always be ignored (which is the de facto requirement on the modern web, so my feelings about it are irrelevant).

(2 comments.)

Two tools I've been using to look into my web traffic volume

Chris's Wiki :: blog

By: cks

17 July 2025 at 03:09

These days, there's an unusually large plague of web crawlers, many of them attributed to LLM activities and most of them acting anonymously, with forged user agents and sometimes widely distributed source IPs. Recently I've been using two tools more and more to try to identify and assess suspicious traffic sources.

The first tool is Anarcat's asncounter. Asncounter takes IP addresses, for example from your web server logs, and maps them to ASNs (roughly who owns an IP address) and to CIDR netblocks that belong to those ASNs (a single ASN can have a lot of netblocks). This gives you information like:

count   percent ASN     AS
1460    7.55    24940   HETZNER-AS, DE
[...]
count   percent prefix  ASN     AS
1095    5.66    66.249.64.0/20  15169   GOOGLE, US
[...]
85      0.44    49.13.0.0/16    24940   HETZNER-AS, DE
85      0.44    65.21.0.0/16    24940   HETZNER-AS, DE
82      0.42    138.201.0.0/16  24940   HETZNER-AS, DE
71      0.37    135.181.0.0/16  24940   HETZNER-AS, DE
68      0.35    65.108.0.0/16   24940   HETZNER-AS, DE
[...]

While Hetzner is my biggest traffic source by ASN, it's not my biggest source by 'prefix' (a CIDR netblock), because this Hetzner traffic is split up across a bunch of their networks. Since most software operates by CIDR netblocks, not by ASNs, this difference can be important (and unfortunate if you want to block all traffic from a particular ASN).

The second tool is grepcidr. Grepcidr will let you search through a log file, such as your web server logs, for traffic from any particular netblock (or a group of netblocks), such as Google's '66.249.64.0/20'. This lets me find out what sort of requests came from a potentially suspicious network block, for example 'grepcidr 49.13.0.0/16 /var/log/...'. If what I see looks suspicious and has little or no legitimate traffic, I can consider taking steps against that netblock.

Asncounter is probably not (yet) packaged in your Linux distribution. Grepcidr may be, but if it's not it's a C program and simple to compile.

(It wouldn't be too hard to put together an 'asngrep' that would cut out the middleman, but I've so far not attempted to do this.)

PS: Both asncounter and grepcidr can be applied to other sorts of logs with IP addresses, for example sources of SSH brute force password scans. But my web logs are all that I've used them for so far.

(3 comments.)

Doing web things with CGIs is mostly no longer a good idea

Chris's Wiki :: blog

By: cks

15 July 2025 at 02:37

Recently I saw Serving 200 million requests per day with a cgi-bin (via, and there's a follow-up), which talks about how fast modern CGIs can be in compiled languages like Rust and Go (Rust more so than Go, because Go has a runtime that it has to start every time a Go program is executed). I'm a long standing fan of CGIs (and Wandering Thoughts, this blog, runs as a CGI some of the time), but while I admire these articles, I think that you mostly shouldn't consider trying to actually write a CGI these days.

Where and how CGI programs shine is when they have a simple deployment and development model. You write a little program, you put the little program somewhere, and it just works (and it's not going to be particularly slow these days). The programs run only when they get used, and if you're using Apache, you can also make these little programs run as the user who owns that web area instead of the web server user.

Where CGI programs fall down today is that they're unpopular, no longer well supported in various programming environments and frameworks, and they don't integrate with various other tools because these days the tools expect to operate as HTTP (reverse) proxies in front of your HTTP service (for example, Anubis for anti-crawler protections). It's easy to write, for example, a Go HTTP based web service; you can find lots of examples of how to do it (and the pieces are part of Go's standard library). If you want to write a Go CGI, you're actually in luck because Go put that in the standard library, but you're not going to find anywhere near as many examples and of course you won't get that integration with other HTTP reverse proxy tools. Other languages are not necessarily going to be as friendly as Go (including Python, which has removed the 'cgi' standard library package in 3.13).

(Similarly, many modern web servers are less friendly to CGIs than Apache is and will make you assemble more pieces to run them, reducing a number of the deployment advantages of CGIs.)

Only running these 'backend' HTTP server programs when they're needed is not easy today (although it's possible with systemd), so if you have a lot of little things that you can't bundle together into one server program, CGIs may still make sense despite what is generally the extra hassle of developing and running them. But otherwise, a HTTP based service that you run behind your general purpose web server is what modern web development is steering you toward and it's almost certainly going to be the easiest path.

(There's also a lot of large scale software support for deploying things that are HTTP services, with things like load balancers and smart routing frontends and so on and so forth, never mind containers and orchestration environments. If you want to use CGIs in this environment you basically get to add in a little web server as the way the outside world invokes them.)

(2 comments.)

Quick numbers on how common HTTP/2 is on our departmental web server

Chris's Wiki :: blog

By: cks

21 June 2025 at 02:22

Our general purpose departmental web server has supported HTTP/2 for a while. When we added HTTP/2 support it was basically because it was there; HTTP/2 was the new and shiny thing, our Apache configuration could support it, and so it seemed like a friendly gesture to turn HTTP/2 on. Until now, I've never looked at the statistics for how many HTTP requests use HTTP/2 and how many use other HTTP versions.

Our general purpose web server supports both HTTP access and HTTPS access, unless people opt to forcefully redirect their own pages from one to the other (we have plenty of old pages with mixed content problems, so we can't do such a redirection globally). However, these days that may not be much of an issue and browsers may force HTTPS on the initial connection, which will succeed with our server. I mention all of this because unfortunately our logs don't let me see how many requests are HTTP versus HTTPS. In some environments I could assume that all HTTP/2.0 requests were HTTPS, but the standard Ubuntu Apache HTTP/2 configuration enables h2c so I believe we can do HTTP/2.0 over HTTP connections without any sign of this in our current logs.

The overall number is that about 55% of the requests are HTTP/2.0 and all but a tiny trace of the remaining 45% are HTTP/1.1. However, this isn't uniform. For instance, we've somehow become a load bearing source of commonly used ML training data, and requests for this data are about 70% HTTP/2.0. Meanwhile, a URL hierarchy that maps to our anonymous FTP area sees much less activity, probably much of it from automated crawlers, and only 21% of the requests were HTTP/2.0.

If I look at the claimed User-Agents for HTTP/1.1 requests, some things jump out. A lot of requests come from 'pytorch/vision', along with 'GoogleOther', GPTBot, something claiming to be Chrome 83, PetalBot, Applebot, no User-Agent at all, Scrapy, and a whole menagerie of other crawlers. Actual probably authentic browser user agent values are mostly absent, which isn't a really big surprise since I think browsers aggressively do HTTP/2.0 these days.

(A lot of those 'pytorch/vision' requests were for that commonly used ML training data, but they seem to have been dwarfed by the HTTP/2.0 requests from browsers.)

Given even this cursory log analysis, I suspect that for our web server, HTTP/1.1 requests are significantly correlated with access from non-browsers, including crawlers (both overt and covert). Again this isn't really a surprise if modern browsers are trying to use HTTP/2 as much as possible, since most people are running modern browsers (especially Chrome).

What would a multi-user web server look like? (A thought experiment)

Chris's Wiki :: blog

By: cks

13 June 2025 at 02:51

Every so often my thoughts turn to absurd ideas. Today's absurd idea is sparked by my silly systemd wish for moving processes between systemd units, which in turn was sparked by a local issue with Apache CGIs (and suexec). This got me thinking about what a modern 'multi-user' web server would look like, where by multi-user I mean a web server that's intended to serve content operated by many different people (such as many different people's CGIs). Today you can sort of do this for CGIs through Apache suexec, but as noted this has limits.

The obvious way to implement this would be to run a web server process for every different person's web area and then reverse proxy to the appropriate process. Since there might be a lot of people and not all of them are visited very often, you would want these web server processes to be started on demand and then shut down automatically after a period of inactivity, rather than running all of the time (on Linux you could sort of put this together with systemd socket units). These web server processes would run as appropriate Unix UIDs, not as the web server UID, and on Linux under appropriate systemd hierarchies with appropriate limits set.

(Starting web server units through systemd would also mean that your main web server process didn't have to be privileged or have a privileged helper, as Apache does with suexec. You could have the front end web server do the process starting and supervision itself, but then it would also need the privileges to change UIDs and the support for setting other per-user context information, some of which is system dependent.)

Although I'm not entirely fond of it, the simplest way to communicate between the main web server and the per-person web server would be through HTTP. Since HTTP reverse proxies are widely supported, this would also allow people to choose what program they'd use as their 'web server', rather than your default. However, you'd want to provide a default simple web server to handle static files, CGIs, and maybe PHP (which would be even simpler than my idea of a modern simple web server).

The main (or front-end) web server would still want to have a bunch of features like global rate limiting, since it's the only thing in a position to see aggregate requests across everyone's individual server. If you wanted to make life more complicated but also potentially more convenient, you could chose different protocols to handle different people's areas. One person could be handled via a HTTP reverse proxy, but another person might be handled through FastCGI because they purely use PHP and that's most convenient for them (provided that their FastCGI server could handle being started on demand and then stopping later).

While I started thinking of this in the context of personal home pages and personal CGIs, as we support on our main web server, you could also use this for having different people and groups manage different parts of your URL hierarchy, or even different virtual hosts (by making the URL hierarchy of the virtual host that was handed to someone be '(almost) everything').

With a certain amount of work you could probably build this today on Linux with systemd (Unix) socket activation, although I don't know what front-end or back-end web server you'd want to use. To me, it feels like there's a certain elegance to the 'everyone gets their own web server running under their own UID, go wild' aspect of this, rather than having to try to make one web server running as one UID do everything.

(One comment.)

Make Your Own Internet Presence with NetBSD and a 1 euro VPS – Part 1: Your Blog

IT+Notes

By: Stefano Marinelli

22 April 2025 at 05:30

Create a personal blog with NetBSD on a 1€ VPS - efficient, secure, and completely under your control. Minimal cost, maximum performance. Because your thoughts don't need to float in someone else's cloud.

My blocking of some crawlers is an editorial decision unrelated to crawl volume

Chris's Wiki :: blog

By: cks

30 May 2025 at 02:33

Recently I read a lobste.rs comment on one of my recent entries that said, in part:

Repeat after me everyone: the problem with these scrapers is not that they scrape for LLM’s, it’s that they are ill-mannered to the point of being abusive. LLM’s have nothing to do with it.

This may be some people's view but it is not mine. For me, blocking web scrapers here on Wandering Thoughts is partly an editorial decision of whether I want any of my resources or my writing to be fed into whatever they're doing. I will certainly block scrapers for doing what I consider an abusive level of crawling, and in practice most of the scrapers that I block come to my attention due to their volume, but I will block low-volume scrapers because I simply don't like what they're doing it for.

Are you a 'brand intelligence' firm that scrapes the web and sells your services to brands and advertisers? Blocked. In general, do you charge for access to whatever you're generating from scraping me? Probably blocked. Are you building a free search site for a cause (and with a point of view) that I don't particularly like? Almost certainly blocked. All of this is an editorial decision on my part on what I want to be even vaguely associated with and what I don't, not a technical decision based on the scraping's effects on my site.

I am not going to even bother trying to 'justify' this decision. It's a decision that needs no justification to some and to others, it's one that can never be justified. My view is that ethics matter. Technology and our decisions of what to do with technology are not politically neutral. We can make choices, and passively not doing anything is a choice too.

(I could say a lot of things here, probably badly, but ethics and politics are in part about what sort of a society we want, and there's no such thing as a neutral stance on that. See also.)

I would block LLM scrapers regardless of how polite they are. The only difference them being politer would make is that I would be less likely to notice (and then block) them. I'm probably not alone in this view.

(One comment.)

A thought on JavaScript "proof of work" anti-scraper systems

Chris's Wiki :: blog

By: cks

26 May 2025 at 02:50

One of the things that people are increasingly using these days to deal with the issue of aggressive LLM and other web scrapers is JavaScript based "proof of work" systems, where your web server requires visiting clients to run some JavaScript to solve a challenge; one such system (increasingly widely used) is Xe Iaso's Anubis. One of the things that people say about these systems is that LLM scrapers will just start spending the CPU time to run this challenge JavaScript, and LLM scrapers may well have lots of CPU time available through means such as compromised machines. One of my thoughts is that things are not quite as simple for the LLM scrapers as they look.

An LLM scraper is operating in a hostile environment (although its operator may not realize this). In a hostile environment, dealing with JavaScript proof of work systems is not as simple as simply running it, because you can't particularly tell a JavaScript proof of work system from JavaScript that does other things. Letting your scraper run JavaScript means that it can also run JavaScript for other purposes, for example for people who would like to exploit your scraper's CPU to do some cryptocurrency mining, or simply have you run JavaScript for as long as you'll let it keep going (perhaps because they've recognized you as a LLM scraper and want to waste as much of your CPU as possible).

An LLM scraper can try to recognize a JavaScript proof of work system but this is a losing game. The other parties have every reason to make themselves look like a proof of work system, and the proof of work systems don't necessarily have an interest in being recognized (partly because this might allow LLM scrapers to short-cut their JavaScript with optimized host implementations of the challenges). And as both spammers and cryptocurrency miners have demonstrated, there is no honor among thieves. If LLM scrapers dangle free computation in front of people, someone will spring up to take advantage of it. This leaves LLM scrapers trying to pick a JavaScript runtime limit that doesn't cut them off from too many sites, while sites can try to recognize LLM scrapers and increase their proof of work difficulty if they see a suspect.

(This is probably not an original thought, but it's been floating around my head for a while.)

PS: JavaScript proof of work systems aren't the greatest thing, but they're going to happen unless someone convincingly demonstrates a better alternative.

(One comment.)

What keeps Wandering Thoughts more or less free of comment spam (2025 edition)

Chris's Wiki :: blog

By: cks

24 May 2025 at 02:50

Like everywhere else, Wandering Thoughts (this blog) gets a certain amount of automated comment spam attempts. Over the years I've fiddled around with a variety of anti-spam precautions, although not all of them have worked out over time. It's been a long time since I've written anything about this, because one particular trick has been extremely effective ever since I introduced it.

That one trick is a honeypot text field in my 'write a comment' form. This field is normally hidden by CSS, and in any case the label for the field says not to put anything in it. However, for a very long time now, automated comment spam systems seem to operate by stuffing some text into every (text) form field that they find before they submit the form, which always trips over this. I log the form field's text out of curiosity; sometimes it's garbage and sometimes it's (probably) meaningful for the spam comment that the system is trying to submit.

Obviously this doesn't stop human-submitted spam, which I get a small amount of every so often. In general I don't expect anything I can reasonably do to stop humans who do the work themselves; we've seen this play out in email and I don't have any expectations that I can do better. It also probably wouldn't work if I was using a popular platform that had this as a general standard feature, because then it would be worth the time of the people writing automated comment spam systems to automatically recognize it and work around it.

Making comments on Wandering Thoughts also has an additional small obstacle in the way of automated comment spammers, which is that you must initially preview your comment before you can submit it (although you don't have to submit the comment that you previewed, you can edit it after the first preview). Based on a quick look at my server logs, I don't think this matters to the current automated comment spam systems that try things here, as they only appear to try submitting once. I consider requiring people to preview their comment before posting it to be a good idea in general, especially since Wandering Thoughts uses a custom wiki-syntax and a forced preview gives people some chance of noticing any mistakes.

(I think some amount of people trying to write comments here do miss this requirement and wind up not actually posting their comment in the end. Or maybe they decide not to after writing one version of it; server logs give me only so much information.)

In a world that is increasingly introducing various sorts of aggressive precautions against LLM crawlers, including 'proof of work' challenges, all of this may become increasingly irrelevant. This could go either way; either the automated comment spammers die off as more and more systems have protections that are too aggressive for them to deal with, or the automated systems become increasingly browser-based and sidestep my major precaution because they no longer 'see' the honeypot field.

(One comment.)

Thinking about what you'd want in a modern simple web server

Chris's Wiki :: blog

By: cks

22 May 2025 at 02:14

Over on the Fediverse, I said:

I'm currently thinking about what you'd want in a simple modern web server that made life easy for sites that weren't purely static. I think you want CGI, FastCGI, and HTTP reverse proxying, plus process supervision. Automatic HTTPS of course. Rate limiting support, and who knows what you'd want to make it easier to deal with the LLM crawler problem.

(This is where I imagine a 'stick a third party proxy in the middle' mode of operation.)

What I left out of my Fediverse post is that this would be aimed at small scale sites. Larger, more complex sites can and should invest in the power, performance, and so on of headline choices like Apache, Nginx, and so on. And yes, one obvious candidate in this area is Caddy, but at the same time something that has "more scalable" (than alternatives) as a headline features is not really targeting the same area as I'm thinking of.

This goal of simplicity of operation is why I put "process supervision" into the list of features. In a traditional reverse proxy situation (whether this is FastCGI or HTTP), you manage the reverse proxy process separately from the main webserver, but that requires more work from you. Putting process supervision into the web server has the goal of making all of that more transparent to you. Ideally, in common configurations you wouldn't even really care that there was a separate process handling FastCGI, PHP, or whatever; you could just put things into a directory or add some simple configuration to the web server and restart it, and everything would work. Ideally this would extend to automatically supporting PHP by just putting PHP files somewhere in the directory tree, just like CGI; internally the web server would start a FastCGI process to handle them or something.

(Possibly you'd implement CGI through a FastCGI gateway, but if so this would be more or less pre-configured into the web server and it'd ship with a FastCGI gateway for this (and for PHP).)

This is also the goal for making it easy to stick a third party filtering proxy in the middle of processing requests. Rather than having to explicitly set up two web servers (a frontend and a backend) with an anti-LLM filtering proxy in the middle, you would write some web server configuration bits and then your one web server would split itself into a frontend and a backend with the filtering proxy in the middle. There's no technical reason you can't do this, and even control what's run through the filtering proxy and what's served directly by the front end web server.

This simple web server should probably include support for HTTP Basic Authentication, so that you can easily create access restricted areas within your website. I'm not sure if it should include support for any other sort of authentication, but if it did it would probably be OpenID Connect (OIDC), since that would let you (and other people) authenticate through external identity providers.

It would be nice if the web server included some degree of support for more or less automatic smart in-memory (or on-disk) caching, so that if some popular site linked to your little server, things wouldn't explode (or these days, if a link to your site was shared on the Fediverse and all of the Fediverse servers that it propagated to immediately descended on your server). At the very least there should be enough rate limiting that your little server wouldn't fall over, and perhaps some degree of bandwidth limits you could set so that you wouldn't wake up to discover you had run over your outgoing bandwidth limits and were facing large charges.

I doubt anyone is going to write such a web server, since this isn't likely to be the kind of web server that sets the world on fire, and probably something like Caddy is more or less good enough.

(Doing a good job of writing such a server would also involve a fair amount of research to learn what people want to run at a small scale, how much they know, what sort of server resources they have or want to use, what server side languages they wind up using, what features they need, and so on. I certainly don't know enough about the small scale web today.)

PS: One reason I'm interested in this is that I'd sort of like such a server myself. These days I use Apache and I'm quite familiar with it, but at the same time I know it's a big beast and sometimes it has entirely too many configuration options and special settings and so on.

(6 comments.)

In Apache, using OIDC instead of SAML makes for easier testing

Chris's Wiki :: blog

By: cks

9 May 2025 at 02:56

In my earlier installment, I wrote about my views on the common Apache modules for SAML and OIDC authentication, where I concluded that OpenIDC was generally easier to use than Mellon (for SAML). Recently I came up with another reason to prefer OIDC, one sufficiently strong enough that we converted one of our remaining Mellon uses over to OIDC. The advantage is that OIDC is easier to test if you're building a new version of your web server under another name.

Suppose that you're (re)building a version of your Apache based web server with authentication on, for example, a new version of Ubuntu, using a test server name. You want to test that everything still works before you deploy it, including your authentication. If you're using Mellon, as far as I can see you have to generate an entirely new SP configuration using your test server's name and then load it into your SAML IdP. You can't use your existing SAML SP configuration from your existing web server, because it specifies the exact URL the SAML IdP needs to use for various parts of the SAML protocol, and of course those URLs point to your production web server under its production name. As far as I know, to get another set of URLs that point to your test server, you need to set up an entirely new SP configuration.

OIDC has an equivalent thing in its redirect URI, but the OIDC redirect URL works somewhat differently. OIDC identity providers typically allow you to list multiple allowed redirect URIs for a given OIDC client, and it's the client that tells the server what redirect URI to use during authentication. So when you need to test your new server build under a different name, you don't need to register a new OIDC client; you can just add some more redirect URIs to your existing production OIDC client registration to allow your new test server to provide its own redirect URI. In the OpenIDC module, this will typically require no Apache configuration changes at all (from the production version), as the module automatically uses the current virtual host as the host for the redirect URI. This makes testing rather easier in practice, and it also generally tests the Apache OIDC configuration you'll use in production, instead of a changed version of it.

(You can put a hostname in the Apache OIDCRedirectURI directive, but it's simpler to not do so. Even if you did use a full URL in this, that's a single change in a text file.)

(One comment.)

The HTTP status codes of responses from about 22 hours of traffic to here (part 2)

Chris's Wiki :: blog

By: cks

3 May 2025 at 03:09

A few months ago, I wrote an entry about this topic, because I'd started putting in some blocks against crawlers, including things that claimed to be old versions of browsers, and I'd also started rate-limiting syndication feed fetching. Unfortunately, my rules at the time were flawed, rejecting a lot of people that I actually wanted to accept. So here are some revised numbers from today, a day when my logs suggest that I've seen what I'd call broadly typical traffic and traffic levels.

I'll start with the overall numbers (for HTTP status codes) for all requests:

  10592 403		[26.6%]
   9872 304		[24.8%]
   9388 429		[23.6%]
   8037 200		[20.2%]
   1629 302		[ 4.1%]
    114 301
     47 404
      2 400
      2 206

This is a much more balanced picture of activity than the last time around, with a lot less of the overall traffic being HTTP 403s. The HTTP 403s are from aggressive blocks, the HTTP 304s and HTTP 429s are mostly from syndication feed fetchers, and the HTTP 302s are mostly from things with various flaws that I redirect to informative static pages instead of giving HTTP 403s. The two HTTP 206s were from Facebook's 'externalhit' agent on a recent entry. A disturbing amount of the HTTP 403s were from Bing's crawler and almost 500 of them were from something claiming to be an Akkoma Fediverse server. 8.5% of the HTTP 403s were from something using Go's default User-Agent string.

The most popular User-Agent strings today for successful requests (of anything) were for versions of NetNewsWire, FreshRSS, and Miniflux, then Googlebot and Applebot, and then Chrome 130 on 'Windows NT 10'. Although I haven't checked, I assume that all of the first three were for syndication feeds specifically, with few or no fetches of other things. Meanwhile, Googlebot and Applebot can only fetch regular pages; they're blocked from syndication feeds.

The picture for syndication feeds looks like this:

   9923 304		[42%]
   9535 429		[40%]
   1984 403		[ 8.5%]
   1600 200		[ 6.8%]
    301 302
     34 301
      1 404

On the one hand it's nice that 42% of syndication feed fetches successfully did a conditional GET. On the other hand, it's not nice that 40% of them got rate-limited, or that there were clearly more explicitly blocked requests that there were HTTP 200 responses. On the sort of good side, 37% of the blocked feed fetches were from one IP that's using "Go-http-client/1.1" as its User-Agent (and which accounts for 80% of the blocks of that). This time around, about 58% of the requests were for my syndication feed, which is better than it was before but still not great.

These days, if certain problems are detected in a request I redirect the request to a static page about the problem. This gives me some indication of how often these issues are detected, although crawlers may be re-visiting the pages on their own (I can't tell). Today's breakdown of this is roughly:

   78%  too-old browser
   13%  too generic a User-Agent
    9%  unexpectedly using HTTP/1.0

There were slightly more HTTP 302 responses from requests to here than there were requests for these static pages, so I suspect that not everything that gets these redirects follows them (or at least doesn't bother re-fetching the static page).

I hope that the better balance in HTTP status codes here is a sign that I have my blocks in a better state than I did a couple of months ago. It would be even better if the bad crawlers would go away, but there's little sign of that happening any time soon.

(4 comments.)

Chrome and the burden of developing a browser

Chris's Wiki :: blog

By: cks

25 April 2025 at 02:53

One part of the news of the time interval is that the US courts may require Google to spin off Chrome (cf). Over on the Fediverse, I felt this wasn't a good thing:

I have to reluctantly agree that separating Chrome from Google would probably go very badly¹. Browsers are very valuable but also very expensive public goods, and our track record of funding and organizing them as such in a way to not wind up captive to something is pretty bad (see: Mozilla, which is at best questionable on this). Google is not ideal but at least Chrome is mostly a sideline, not a main hustle.

¹ <Lauren Weinstein Fediverse post> [...]

One possible reaction to this is that it would be good for everyone if people stopped spending so much money on browsers and so everything involving them slowed down. Unfortunately, I don't think that this would work out the way people want, because popular browsers are costly beasts. To quote what I said on the Fediverse:

I suspect that the cost of simply keeping the lights on in a modern browser is probably on the order of plural millions of dollars a year. This is not implementing new things, this is fixing bugs, keeping up with security issues, monitoring CAs, and keeping the development, CI, testing, and update infrastructure running. This has costs for people, for servers, and for bandwidth.

The reality of the modern Internet is that browsers are load bearing infrastructure; a huge amount of things run through them, including and especially on minority platforms. Among other things, no browser is 'secure' and all of them are constantly under attack. We want browser projects that are used by lots of people to have enough resources (in people, build infrastructure, update servers, and so on) to be able to rapidly push out security updates. All browsers need a security team and any browser with addons (which should be all of them) needs a security team for monitoring and dealing with addons too.

(Browsers are also the people who keep Certificate Authorities honest, and Chrome is very important in this because of how many people use it.)

On the whole, it's a good thing for the web that Chrome is in the hands of an organization that can spend tens of millions of dollars a year on maintaining it without having to directly monetize it in some way. It would be better if we could collectively fund browsers as the public good that they are without having corporations in the way, because Google absolutely corrupts Chrome (also) and Mozilla has stumbled spectacularly (more than once). But we have to deal with the world that we have, not the world that we'd like to have, and in this world no government seems to be interested in seriously funding obvious Internet public goods (not only browsers but also, for example, free TLS Certificate Authorities).

(It's not obvious that a government funded browser would come out better overall, but at least there would be a chance of something different than the narrowing status quo.)

PS: Another reason that spending on browsers might not drop is that Apple (with Safari) and Microsoft (with Edge) are also in the picture. Both of these companies might take the opportunity to slow down, or they might decide that Chrome's potentially weak new position was a good moment to push for greater dominance and maybe lock-in through feature leads.

(One comment.)

The appeal of serving your web pages with a single process

Chris's Wiki :: blog

By: cks

18 April 2025 at 02:58

As I slowly work on updating the software behind this blog to deal with the unfortunate realities of the modern web (also), I've found myself thinking (more than once) how much simpler my life would be if I was serving everything through a single process, instead of my eccentric, more or less stateless CGI-based approach. The simple great thing about doing everything through a single process (with threads, goroutines, or whatever inside it for concurrency) is that you have all the shared state you could ever want, and that shared state makes it so easy to do so many things.

Do you have people hitting one URL too often from a single IP address? That's easy to detect, track, and return HTTP 429 responses for until they cool down. Do you have an IP making too many requests across your entire site? You can track that sort of volume information. There's all sorts of potential bad stuff that it's at least easier to detect when you have easy shared global state. And the other side of this is that it's also relatively easy to add simple brute force caching in a single process with global state.

(Of course you have some practical concerns about memory and CPU usage, depending on how much stuff you're keeping track of and for how long.)

You can do a certain amount of this detection with a separate 'database' process of some sort (or a database file, like sqlite), and there's various specialized software that will let you keep this sort of data in memory (instead of on disk) and interact with it easily. But this is an extra layer or two of overhead over simply updating things in your own process, especially if you have to set up things like a database schema for what you're tracking or caching.

(It's my view that ease of implementation is especially useful when you're not sure what sort of anti-abuse measures are going to be useful. The easier it is to implement something and at least get logs of what and how much it would have done, the more you're going to try and the more likely you are to hit on something that works for you.)

Unfortunately it seems like we're only going to need more of this kind of thing in our immediate future. I don't expect the level of crawling and abuse to go down any time soon; if anything, I expect it to keep going up, especially as more and more websites move behind effective but heavyweight precautions and the crawlers turn more of their attention to the rest of us.

(One comment.)

Mandatory short duration TLS certificates are probably coming soon

Chris's Wiki :: blog

By: cks

13 April 2025 at 02:56

The news of the time interval is that the maximum validity period for TLS certificates will be lowered to 47 days by March 2029, unless the CA/Browser Forum changes its mind (or is forced to) before then. The details are discussed in SC-081. In skimming the mailing list thread on the votes, a number of organizations that voted to abstain seem unenthused (and uncertain that it can actually be implemented), so this may not come to pass, especially on the timeline proposed here.

If and when this comes to pass, I feel confident that this will end manual certificate renewals at places that are still doing them. With that, it will effectively end Certificate Authorities that don't have an API that you can automatically get certificates through (not necessarily a free or public API). I'm not sure what it's going to do to the Certificate Authority business models for commercial CAs, but I also don't think the browsers care about that issue and the browsers are driving.

This will certainly cause pain. I know of places around the university that are still manually handling one-year TLS certificates; those places will have to change over the course of a few years. This pain will arrive well before 2029; based on the proposed changes, starting March 15, 2027, the maximum certificate validity period will be 100 days, which is short enough to be decidedly annoying. Even a ~~250~~ 200 day validity period (starting March 15 2026) will be somewhat painful to do by hand.

I expect one consequence to be that some number of (internal) devices stop having valid TLS certificates, because they can only have certificates loaded into them manually and no one is going to do that every 40-dd or even every 90-odd days. You might manually get and load a valid TLS certificate every year; you certainly won't do it every three months (well, almost no one will).

I hope that this will encourage the creation and growth of more alternatives to Let's Encrypt, even if not all of them are free, since more and more CAs will be pushed to have an API and one obvious API to adopt is ACME.

(I can also imagine ways to charge for an ACME based API, even with standard ACME clients. One obvious way would be to only accept ACME requests for domains that the CA had some sort of site license with. You'd establish the site license through out of band means, not ACME.)

(2 comments.)

Launching BSSG - My Journey from Dynamic CMS to Bash Static Site Generator

IT+Notes

By: Stefano Marinelli

7 April 2025 at 06:11

Announcing the public release of BSSG, a Bash Static Site Generator born from a personal journey away from complex dynamic CMS. Discover a simple, portable alternative for your blog.

FediMeteo: How a Tiny €4 FreeBSD VPS Became a Global Weather Service for Thousands

IT+Notes

By: Stefano Marinelli

26 February 2025 at 06:00

How a simple idea turned into an international weather service on the Fediverse.

Caching snac Proxied Media with Nginx

IT+Notes

By: Stefano Marinelli

8 February 2025 at 15:00

Using nginx to cache proxied media in snac to improve performance and privacy

Improving snac Performance with Nginx Proxy Cache

IT+Notes

By: Stefano Marinelli

29 January 2025 at 08:00

Using nginx to cache snac multimedia files to boost performance

Trapping Misbehaving Bots in an A.I. Labyrinth

Pixel Envy

By: Nick Heer

22 March 2025 at 04:32

Reid Tatoris, Harsh Saxena, and Luis Miglietti, of Cloudflare:

Today, we’re excited to announce AI Labyrinth, a new mitigation approach that uses AI-generated content to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect “no crawl” directives. When you opt in, Cloudflare will automatically deploy an AI-generated set of linked pages when we detect inappropriate bot activity, without the need for customers to create any custom rules.

Two thoughts:

This is amusing. Nothing funnier than using someone’s own words or, in this case, technology against them.
This is surely going to lead to the same arms race as exists now between privacy protections and hostile adtech firms. Right?

⌥ Permalink

Doing multi-tag matching through URLs on the modern web

Chris's Wiki :: blog

By: cks

14 March 2025 at 02:46

So what happened is that Mike Hoye had a question about a perfectly reasonable ideas:

Question: is there wiki software out there that handles tags (date, word) with a reasonably graceful URL approach?

As in, site/wiki/2020/01 would give me all the pages tagged as 2020 and 01, site/wiki/foo/bar would give me a list of articles tagged foo and bar.

I got nerd-sniped by a side question but then, because I'd been nerd-sniped, I started thinking about the whole thing and it got more and more hair-raising as a thing done in practice.

This isn't because the idea of stacking selections like this is bad; 'site/wiki/foo/bar' is a perfectly reasonable and good way to express 'a list of articles tagged foo and bar'. Instead, it's because of how everything on the modern web eventually gets visited combined with how, in the natural state of this feature, 'site/wiki/bar/foo' is just a valid a URL for 'articles tagged both foo and bar'.

The combination, plus the increasing tendency of things on the modern web to rattle every available doorknob just to see what happens, means that even if you don't advertise 'bar/foo', sooner or later things are going to try it. And if you do make the combinations discoverable through HTML links, crawlers will find them very fast. At a minimum this means crawlers will see a lot of essentially duplicated content, and you'll have to go through all of the work to do the searches and generate the page listings and so on.

If I was going to implement something like this, I would define a canonical tag order and then, as early in request processing as possible, generate a HTTP redirect from any non-canonical ordering to the canonical one. I wouldn't bother checking if the tags were existed or anything, just determine that they are tags, put them in canonical order, and if the request order wasn't canonical, redirect. That way at least all of your work (and all of the crawler attention) is directed at one canonical version. Smart crawlers will notice that this is a redirect to something they already have (and hopefully not re-request it), and you can more easily use caching.

(And if search engines still matter, the search engines will see only your canonical version.)

This probably holds just as true for doing this sort of tag search through query parameters on GET queries; if you expose the result in a URL, you want to canonicalize it. However, GET query parameters are probably somewhat safer if you force people to form them manually and don't expose links to them. So far, web crawlers seem less likely to monkey around with query parameters than with URLs, based on my limited experience with the blog.

Some views on the common Apache modules for SAML or OIDC authentication

Chris's Wiki :: blog

By: cks

12 March 2025 at 03:01

Suppose that you want to restrict access to parts of your Apache based website but you want something more sophisticated and modern than Apache Basic HTTP authentication. The traditional reason for this was to support 'single sign on' across all your (internal) websites; the modern reason is that a central authentication server is the easiest place to add full multi-factor authentication. The two dominant protocols for this are SAML and OIDC. There are commonly available Apache authentication modules for both protocols, in the form of Mellon (also) for SAML and OpenIDC for OIDC.

I've now used or at least tested the Ubuntu 24.04 version of both modules against the same SAML/OIDC identity provider, primarily because when you're setting up a SAML/OIDC IdP you need to be able to test it with something. Both modules work fine, but after my experiences I'm more likely to use OpenIDC than Mellon in most situations.

Mellon has two drawbacks and two potential advantages. The first drawback is that setting up a Mellon client ('SP') is more involved. Most of annoying stuff is automated for you with the mellon_create_metadata script (which you can get from the Mellon repository if it's not in your Mellon package), but you still have to give your IdP your XML blob and get their XML blob. The other drawback is that Mellon isn't integrated into the Apache 'Require' framework for authorization decisions; instead you have to make do with Mellon-specific directives.

The first potential advantage is that Mellon has a straightforward story for protecting two different areas of your website with two different IdPs, if you need to do that for some reason; you can just configure them in separate <Location> or <Directory> blocks and everything works out. If anything, it's a bit non-obvious how to protect various disconnected bits of your URL space with the same IdP without having to configure multiple SPs, one for each protected section of URL space. The second potential advantage is that in general SAML has an easier story for your IdP giving you random information, and Mellon will happily export every SAML attribute it gets into the environment your CGI or web application gets.

The first advantage of OpenIDC is that it's straightforward to configure when you have a single IdP, with no XML and generally low complexity. It's also straightforward to protect multiple disconnected URL areas with the same IdP but possibly different access restrictions. A third advantage is that OpenIDC is integrated into Apache's 'Require' system, although you have to use OpenIDC specific syntax like 'Require claim groups:agroup' (see the OpenIDC wiki on authorization).

In exchange for this, it seems to be quite involved to use OpenIDC if you need to use multiple OIDC identity providers to protect different bits of your website. It's apparently possible to do this in the same virtual host but it seems quite complex and requires a lot of parts, so if I was confronted with this problem I would try very hard to confine each web thing that needed a different IdP into a different virtual host. And OpenIDC has the general OIDC problem that it's harder to expose random information.

(All of the important OpenIDC Apache directives about picking an IdP can't be put in <Location> or <Directory> blocks, only in a virtual host as a whole. If you care about this, see the wiki on Multiple Providers and also access to different URL paths on a per-provider basis.)

We're very likely to only ever be working with a single IdP, so for us OpenIDC is likely to be easier, although not hugely so.

Sidebar: The easy approach for group based access control with either

Both Mellon and OpenIDC work fine together with the traditional Apache AuthGroupFile directive, provided (of course) that you have or build an Apache format group file using what you've told Mellon or OpenIDC to use as the 'user' for Apache authentication. If your IdP is using the same user (and group) information as your regular system is, then you may well already have this information around.

(This is especially likely if you're migrating from Apache Basic HTTP authentication, where you already needed to build this sort of stuff.)

Building your own Apache group file has the additional benefit that you can augment and manipulate group information in ways that might not fit well into your IdP. Your IdP has the drawback that it has to be general; your generated Apache group file can be narrowly specific for the needs of a particular web area.

(One comment.)

The web browser as an enabler of minority platforms

Chris's Wiki :: blog

By: cks

11 March 2025 at 03:35

Recently, I got involved in a discussion on the Fediverse over what I will simplify to the desirability (or lack of it) of cross platform toolkits, including the browser, and how they erase platform personality and opinions. This caused me to have a realization about what web browser based applications are doing for me, which is that being browser based is what lets me use them at all.

My environment is pretty far from being a significant platform; I think Unix desktop share is in the low single percent under the best of circumstances. If people had to develop platform specific versions of things like Grafana (which is a great application), they'd probably exist for Windows, maybe macOS, and at the outside, tablets (some applications would definitely exist on phones, but Grafana is a bit of a stretch). They probably wouldn't exist on Linux, especially not for free.

That the web browser is a cross platform environment means that I get these applications (including the Fediverse itself) essentially 'for free' (which is to say, it's because of the efforts of web browsers to support my platform and then give me their work for free). Developers of web applications don't have to do anything to make them work for me, not even so far as making it possible to build their software on Linux; it just happens for them without them even having to think about it.

Although I don't work in the browser as much as some people do, looking back the existence of implicitly cross platform web applications has been a reasonably important thing in letting me stick with Linux.

This applies to any minority platform, not just Linux. All you need is a sufficiently capable browser and you have access to a huge range of (web) applications.

(Getting that sufficiently capable browser can be a challenge on a sufficiently minority platform, especially if you're not on a major architecture. I'm lucky in that x86 Linux is a majority minority platform; people on FreeBSD or people on architectures other than x86 and 64-bit ARM may be less happy with the situation.)

PS: I don't know if what we have used the web for really counts as 'applications', since they're mostly HTML form based things once you peel a few covers off. But if they do count, the web has been critical in letting us provide them to people. We definitely couldn't have built local application versions of them for all of the platforms that people here use.

(I'm sure this isn't a novel thought, but the realization struck (or re-struck) me recently so I'm writing it down.)

(3 comments.)

HTTP connections are part of the web's long tail

Chris's Wiki :: blog

By: cks

22 February 2025 at 03:32

I recently read an article that, among other things, apparently seriously urging browser vendors to deprecate and disable plain text HTTP connections by the end of October of this year (via, and I'm deliberately not linking directly to the article). While I am a strong fan of HTTPS in general, I have some feelings about a rapid deprecation of HTTP. One of my views is that plain text HTTP is part of the web's long tail.

As I'm using the term here, the web's long tail (also is the huge mass of less popular things that are individually less frequently visited but which in aggregate amount to a substantial part of the web. The web's popular, busy sites are frequently updated and can handle transitions without problems. They can readily switch to using modern HTML, modern CSS, modern JavaScript, and so on (although they don't necessarily do so), and along with that update all of their content to HTTPS. In fact they mostly or entirely have done so over the last ten to fifteen years. The web's long tail doesn't work like that. Parts of it use old JavaScript, old CSS, old HTML, and these days, plain HTTP (in addition to the people who have objections to HTTPS and deliberately stick to HTTP).

The aggregate size and value of the long tail is part of why browsers have maintained painstaking compatibility back to old HTML so far, including things like HTML Image Maps. There's plenty of parts of the long tail that will never be updated to have HTTPS or work properly with it. For browsers to discard HTTP anyway would be to discard that part of the long tail, which would be a striking break with browser tradition. I don't think this is very likely and I certainly hope that it never comes to pass, because that long tail is part of what gives the web its value.

(It would be an especially striking break since a visible percentage of page loads still happen with HTTP instead of HTTPS. For example, Google's stats say that globally 5% of Windows Chrome page loads apparently still use HTTP. That's roughly one in twenty page loads, and the absolute number is going to be very large given how many page loads happen with Chrome on Windows. This large number is one reason I don't think this is at all a serious proposal; as usual with this sort of thing, it ignores that social problems are the ones that matter.)

PS: Of course, not all of the HTTP connections are part of the web's long tail as such. Some of them are to, for example, manage local devices via little built in web servers that simply don't have HTTPS. The people with these devices aren't in any rush to replace them just because some people don't like HTTP, and the vendors who made them aren't going to update their software to support (modern) HTTPS even for the devices which support firmware updates and where the vendor is still in business.

(You can view them as part of the long tail of 'the web' as a broad idea and interface, even though they're not exposed to the world the way that the (public) web is.)

(3 comments.)

The HTTP status codes of responses from about 21 hours of traffic to here

Chris's Wiki :: blog

By: cks

17 February 2025 at 04:06

You may have heard that there are a lot of crawlers out there these days, many of them apparently harvesting training data for LLMs. Recently I've been getting more strict about access to this blog, so for my own interest I'm going to show statistics on what HTTP status codes all of the requests to here got in the past roughly 21 hours and a bit. I think this is about typical, although there may be more blocked things than usual.

I'll start with the overall numbers for all requests:

 22792 403      [45%]
  9207 304      [18.3%]
  9055 200      [17.9%]
  8641 429      [17.1%]
   518 301
    58 400
    33 404
     2 206
     1 302

HTTP 403 is the error code that people get on blocked access; I'm not sure what's producing the HTTP 400s. The two HTTP 206s were from LinkedIn's bot against a recent entry and completely puzzle me. Some of the blocked access is major web crawlers requesting things that they shouldn't (Bing is a special repeat offender here), but many of them are not. Between HTTP 403s and HTTP 429s, 62% or so of the requests overall were rejected and only 36% got a useful reply.

(With less thorough and active blocks, that would be a lot more traffic for Wandering Thoughts to handle.)

The picture for syndication feeds is rather different, as you might expect, but not quite as different as I'd like:

  9136 304    [39.5%]
  8641 429    [37.4%]
  3614 403    [15.6%]
  1663 200    [ 7.2%]
    19 301

Some of those rejections are for major web crawlers and almost a thousand are for a pair of prolific, repeat high volume request sources, but a lot of them aren't. Feed requests account for 23073 requests out of a total of 50307, or about 45% of the requests. To me this feels quite low for anything plausibly originated from humans; most of the time I expect feed requests to significantly outnumber actual people visiting.

(In terms of my syndication feed rate limiting, there were 19440 'real' syndication feed requests (84% of the total attempts), and out of them 44.4% were rate-limited. That's actually a lower level of rate limiting than I expected; possibly various feed fetchers have actually noticed it and reduced their attempt frequency. 46.9% made successful conditional GET requests (ones that got a HTTP 304 response) and 8.5% actually fetched feed data.)

DWiki, the wiki engine behind the blog, has a concept of alternate 'views' of pages. Syndication feeds are alternate views, but so are a bunch of other things. Excluding syndication feeds, the picture for requests of alternate views of pages is:

The most blocked alternate views are:

  1589 ?writecomment
  1336 ?normal
  1309 ?source
   917 ?showcomments

(The most successfully requested view is '?showcomments', which isn't really a surprise to me; I expect search engines to look through that, for one.)

If I look only at plain requests, not requests for syndication feeds or alternate views, I see:

 13679 403   [64.5%]
  6882 200   [32.4%]
   460 301
    68 304
    58 400
    33 404
     2 206
     1 302

This means the breakdown of traffic is 21183 normal requests (42%), 45% feed requests, and the remainder for alternate views, almost all of which were rejected.

Out of the HTTP 403 rejections across all requests, the 'sources' break down something like this:

  7116 Forged Chrome/129.0.0.0 User-Agent
  1451 Bingbot
  1173 Forged Chrome/121.0.0.0 User-Agent
   930 PerplexityBot ('AI' LLM data crawler)
   915 Blocked sources using a 'Go-http-client/1.1' User-Agent

Those HTTP 403 rejections came from 12619 different IP addresses, in contrast to the successful requests (HTTP 2xx and 3xx codes), which came from 18783 different IP addresses. After looking into the ASN breakdown of those IPs, I've decided that I can't write anything about them with confidence, and it's possible that part of what is going on is that I have mis-firing blocking rules (alternately, I'm being hit from a big network of compromised machines being used as proxies, perhaps the same network that is the Chrome/129.0.0.0 source). However, some of the ASNs that show up highly are definitely ones I recognize from other contexts, such as attempted comment spam.

Update: Well that was a learning experience about actual browser User-Agents. Those 'Chrome/129.0.0.0' User-Agents may well not have been so forged (although people really should be running more current versions of Chrome). I apologize to the people using real current Chrome versions that were temporarily unable to read the blog because of my overly-aggressive blocks.

(4 comments.)

Web application design and the question of what is a "route"

Chris's Wiki :: blog

By: cks

8 February 2025 at 04:16

So what happened is that Leah Neukirchen ran a Fediverse poll on how many routes your most complex web app had, and I said that I wasn't going to try to count how many DWiki had and then gave an example of combining two things in a way that I felt was a 'route' (partly because 'I'm still optimizing the router' was one poll answer). This resulted in a discussion where one of the questions I draw from it is "what is a route, exactly".

At one level counting up routes in your web application seems simple. For instance, in our Django application I could count up the URL patterns listed in our 'urlpatterns' setting (which gives me a larger number than I expected for what I think of as a simple Django application). Pattern delegation may make this a bit tedious, but it's entirely tractable. However, I think that this only works for certain sorts of web applications that are designed in a particular way, and as it happens I have an excellent example of where the concept of "route" gets fuzzy.

DWiki, the engine behind this blog, is actually a general filesystem based wiki (engine). As a filesystem based wiki, what it started out doing was to map any URL path to a filesystem object and then render the filesystem object in some appropriate way; for example, directories turn into a listing of their contents. With some hand-waving you could say that this is one route, or two once we through in an optional system for handling static assets. Alternately you could argue that this is two (or three) routes, one route for directories and one route for files, because the two are rendered differently (although that's actually implemented in templates, not in code, so maybe they're one route after all).

Later I added virtual directories, which are added to the end of directory paths and are used to restrict what things are visible within the directory (or directory tree). Both the URL paths involved and the actual matching against them look like normal routing (although they're not handled through a traditional router approach), so I should probably count them as "routes", adding four or so more routes, so you could say that DWiki has somewhere between five and seven routes (if you count files and directories separately and throw in a third route for static asset files).

However, I've left out a significant detail, which is visible in how both the blog's front page and the Atom syndication feed of the blog use the same path in their URLs, and the blog's front page looks nothing like a regular directory listing. What's going on is that how DWiki presents both files and especially directories depends on the view they're shown in, and DWiki has a bunch of views; all of the above differences are because of different views being used. Standard blog entry files can be presented in (if I'm counting right) five different views. Directories have a whole menagerie of views that they support, including a 'blog' view. Because views are alternate presentations of a given filesystem object and thus URL path, they're provided as a query parameter, not as part of the URL's path.

Are DWiki's views routes, and if they are, how do we count them? Is each unique combination of a page type (including virtual directories) and a view a new route? One thing that may affect your opinion of this is that a lot of the implementation of views is actually handled in DWiki's extremely baroque templates, not code. However, DWiki's code knows a full list of what views exist (and templates have to be provided or you'll get various failures).

(I've also left out a certain amount of complications, like redirections and invalid page names.)

The broad moral I draw from this exercise is that the model of distinct 'routes' is one that only works for certain sorts of web application design. When and where it works well, it's a quite useful model and I think it pushes you toward making good decisions about how to structure your URLs. But in any strong form, it's not a universal pattern and there are ways to go well outside it.

(Interested parties can see a somewhat out of date version of DWiki's code and many templates, although note that both contain horrors. At some point I'll probably update both to reflect my recent burst of hacking on DWiki.)

(2 comments.)

Web spiders (or people) can invent unfortunate URLs for your website

Chris's Wiki :: blog

By: cks

3 February 2025 at 00:55

Let's start with my Fediverse post:

Today in "spiders on the Internet do crazy things": my techblog lets you ask for a range of entries. Normally the range that people ask for is, say, ten entries (the default, which is what you normally get links for). Some deranged spider out there decided to ask for a thousand entries at once and my blog engine sighed, rolled up its sleeves, and delivered (slowly and at large volume).

In related news, my blog engine can now restrict how large a range people can ask for (although it's a hack).

DWiki is the general wiki engine that creates Wandering Thoughts. As part of its generality, it has a feature that shows a range of 'pages' (in Wandering Thoughts these are entries, in general these are files in a directory tree), through what I call virtual directories. As is usual with these things, the range of entries (pages, files) that you're asking for is specified in the URL, with syntax like '<whatever>/range/20-30'.

If you visit the blog front page or similar things, the obvious and discoverable range links you get are for ten entries. You can under some situations get links for slightly bigger ranges, but not substantially larger ones. However, the engine didn't particularly restrict the size of these ranges, so if you wanted to create URLs by hand you could ask for very large ranges.

Today, I discovered that two IPs had asked for 1000-entry ranges today, and the blog engine provided them. Based on some additional log information, it looks like it's not the first time that giant ranges have been requested. One of those IPs was an AWS IP, for which my default assumption is that this is a web spider of some source. Even if it's not a conventional web spider, I doubt anyone is asking for a thousand entries at once with the plan of reading them all; that's a huge amount of text, so it's most likely being done to harvest a lot of my entries at once for some purpose.

(Partly because of that and partly because it puts a big load on DWiki, I've now hacked in a mentioned feature to restrict how large a range you can request. Because it's a hack, too-large ranges get HTTP 404 responses instead of something more useful.)

Sidebar: on the "virtual directories" name and feature

All of DWiki's blog parts are alternate views of a directory hierarchy full of files, where each file is a 'page' and in the context of Wandering Thoughts, almost all pages are blog entries (on the web, the 'See as Normal' link at the bottom will show you the actual directory view of something). A 'virtual directory' is a virtual version of the underlying real directory or directory hierarchy that only shows some pages, for example pages from 2025 or a range of pages based on how recent they are.

All of this is a collection of hacks built on top of other hacks, because that's what happens when you start with a file based wiki engine and decide you can make it be a blog too with only a few little extra features (as a spoiler, it did not wind up requiring only a few extra things). For example, you might wonder how the blog's front page winds up being viewed as a chronological blog, instead of a directory, and the answer is a hack.

(2 comments.)

Some learning experiences with HTTP cookies in practice

Chris's Wiki :: blog

By: cks

27 January 2025 at 03:29

Suppose, not hypothetically, that you have a dynamic web site that makes minor use of HTTP cookies in a way that varies the output, and also this site has a caching layer. Naturally you need your caching layer to only serve 'standard' requests from cache, not requests that should get something non-standard. One obvious and simple approach is to skip your cache layer for any request that has a HTTP cookie. If you (I) do this, I have bad news about HTTP requests in practice, at least for syndication feed fetchers.

(One thing you might do with HTTP cookies is deliberately bypass your own cache, for example to insure that someone who posts a new comment can immediately see their own comment, even if an older version of the page is in the cache.)

The thing about HTTP cookies is that the HTTP client can send you anything it likes as a HTTP cookie and unfortunately some clients will. For example, one feed reader fetcher deliberately attempts to bypass Varnish caches by sending a cookie with all fetch requests, so if the presence of any HTTP cookie causes you to skip your own cache (and other things you do that use the same logic), well, feeder.co is bypassing your caching layer too. Another thing that happens is that some syndication feed fetching clients appear to sometimes leak unrelated cookies into their HTTP requests.

(And of course if your software is hosted along side other software that might set unrestricted cookies for the entire website, those cookies may leak into requests made to your software. For feed fetching specifically, this is probably most likely in feed readers that are browser addons.)

The other little gotcha is that you shouldn't rely on merely the presence or absence of a 'Cookie:' header in the request to tell you if the request has cookies, because a certain number of HTTP clients appear to send a blank Cookie: header (ie, just 'Cookie:'). You might be doing this directly in a CGI by checking for the presence of $HTTP_COOKIE, or you might be doing this indirectly by parsing any Cookie: header in the request into a 'Cookies' object of some sort (even if the value is blank), in which case you'll wind up with an empty Cookies object.

(You can also receive cookies with a blank value in a Cookies: header, eg 'JSESSIONID=', which appears to be a deliberate decision by the software involved, and seems to be to deal with a bad feed source.)

If you actually care about all of this, as I do now that I've discovered it all, you'll want to specifically check for the presence of your own cookies and ignore any other cookies you see, as well as a blank 'Cookie:' HTTP header. Doing extra special things if you see a 'bypass_varnish=1' cookie is up to you.

(In theory I knew that the HTTP Cookies: header was untrusted client data and shouldn't be trusted, and sometimes even contained bad garbage (which got noted every so often in my logs). In practice I didn't think about the implications of that for some of my own code until now.)

Syndication feeds here are now rate-limited on a per-IP basis

Chris's Wiki :: blog

By: cks

26 January 2025 at 03:30

For a long time I didn't look very much at the server traffic logs for Wandering Thoughts, including what was fetching my syndication feeds and how, partly because I knew that looking at web server logs invariably turns over a rock or two. In the past few months I started looking at my feed logs, and then I spent some time trying to get some high traffic sources to slow down on an ad-hoc basis, which didn't have much success (partly because browser feed reader addons seem bad at this). Today I finally gave in to temptation and added general per-IP rate limiting for feed requests. A single IP that requests a particular syndication feed too soon after its last successful request will receive a HTTP 429 response.

(The actual implementation is a hack, which is one reason I didn't do it before now; DWiki, the engine behind Wandering Thoughts, doesn't have an easy place for dynamically updated shared state.)

This rate-limiting will probably only moderately reduce the load on Wandering Thoughts, for various reasons, but it will make me happier. I'm also looking forward to having a better picture of what I consider 'actual traffic' to Wandering Thoughts, including actual User-Agent usage, without the distortions added by badly behaved browser addons (I'm pretty sure that my casual view of Firefox's popularity for visitors has been significantly distorted by syndication feed over-fetching).

In applying this rate limiting, I've deliberately decided not to exempt various feed reader providers like NewsBlur, Feedbin, Feedly, and so on. Hopefully all of these places will react properly to receiving periodic HTTP 429 requests and not, say, entirely give up fetching my feeds after a while because they're experiencing 'too many errors'. However, time will tell if this is correct (and if my HTTP 429 responses cause them to slow down their often quite frequent syndication feed requests).

In general I'm going to have to see how things develop, and that's a decent part of why I'm doing this at all. I'm genuinely curious how clients will change their behavior (if they do) and what will emerge, so I'm doing a little experiment (one that's nowhere as serious and careful as rachelbythebay's ongoing work).

PS: The actual rate limiting applies a much higher minimum interval for unconditional HTTP syndication feed requests than for conditional ones, for the usual reason that I feel repeated unconditional requests for syndication feeds is rather antisocial, and if a feed fetcher is going to be antisocial I'm not going to talk to it very often.

(7 comments.)

More features for web page generation systems doing URL remapping

Chris's Wiki :: blog

By: cks

23 January 2025 at 04:08

A few years ago I wrote about how web page generation systems should support remapping external URLs (this includes systems that convert some form of wikitext to HTML). At the time I was mostly thinking about remapping single URLs and mentioned things like remapping prefixes (so you could remap an entire domain into web.archive.org) as something for a fancier version. Well, the world turns and things happen and I now think that such prefix remapping is essential; even if you don't start out with it, you're going to wind up with it in the longer term.

(To put it one way, the reality of modern life is that sometimes you no longer want to be associated with some places. And some day, my Fediverse presence may also move.)

In light of a couple of years of churn in my website landscape (after what was in hindsight a long period of stability), I now have revised views on the features I want in a (still theoretical) URL remapping system for Wandering Thoughts. The system I want should be able to remap individual URLs, entire prefixes, and perhaps regular expressions with full scale rewrites (or maybe some scheme with wildcard matching), although I don't currently have a use for full scale regular expression rewrites. As part of this, there needs to be some kind of priority or hierarchy between different remappings that can all potentially match the same URL, because there's definitely at least one case today where I want to remap 'asite/a/*' somewhere and all other 'asite/*' URLs to something else. While it's tempting to do something like 'most specific thing matches', working out what is most specific from a collection of different sorts of remapping rules seems a bit hard, so I'd probably just implement it as 'first match wins' and manage things by ordering matches in the configuration file.

('Most specific match wins' is a common feature in web application frameworks for various reasons, but I think it's harder to implement here, especially if I allow arbitrary regular expression matches.)

Obviously the remapping configuration file should support comments (every configuration system needs to). Less obviously, I'd support file inclusion or the now common pattern of a '<whatever>.d' directory for drop in files, so that remapping rules can be split up by things like the original domain rather than having to all be dumped into an ever-growing single configuration file.

(Since more and more links rot as time passes, we can pretty much guarantee that the number of our remappings is going to keep growing.)

Along with the remapping, I may want something (ie, a tiny web application) that dynamically generates some form of 'we don't know where you can find this now but here is what the URL used to be' page for any URL I feed it. The obvious general reason for this is that sometimes old domain names get taken over by malicious parties and the old content is nowhere to be found, not even on web.archive.org. In that case you don't want to keep a link to what's now a malicious site, but you also don't have any other valid target for your old link. You could rewrite the link to some invalid domain name and leave it to the person visiting you and following the link to work out what happened, but it's better to be friendly.

(This is where you want to be careful about XSS and other hazards of operating what is basically an open 'put text in and we generate a HTML page with it shown in some way' service.)

(One comment.)

The programmable web browser was and is inevitable

Chris's Wiki :: blog

By: cks

4 January 2025 at 03:40

In a comment on my entry on why the modern web is why web browsers can't have nice things, superkuh wrote in part:

In the past it was seen as crazy to open every executable file someone might send you over the internet (be it email, ftp, web, or whatever). But sometime in the 2010s it became not only acceptable, but standard practice to automatically run every executable sent to you by any random endpoint on the internet.

For 'every executable' you should read 'every piece of JavaScript', which is executable code that is run by your browser as a free and relatively unlimited service provided to every web page you visit. The dominant thing restraining the executables that web pages send you is the limited APIs that browsers provide, which is why they provide such limited APIs. This comment sparked a chain of thoughts that led to a thesis.

I believe that the programmable web browser was (and is) inevitable. I don't mean this just in the narrow sense that if it hadn't been JavaScript it would have been Flash or Java applets or Lua or WASM or some other relatively general purpose language that the browser would up providing. Instead, I mean it in a broad and general sense, because 'programmability' of the browser is driven by a general and real problem.

For almost as long as the web has existed, people have wanted to create web pages that had relatively complex features and interactions. They had excellent reasons for this; they wanted drop-down or fold-out menus to save screen space so that they could maximize the amount of space given to important stuff instead of navigation, and they wanted to interactively validate form contents before submission for fast feedback to the people filling them in, and so on. At the same time, browser developers didn't want to (and couldn't) program every single specific complex feature that web page authors wanted, complete with bespoke HTML markup for it and so on. To enable as many of these complex features as possible with as little work on their part as possible, browser developers created primitives that could be assembled together to create more sophisticated features, interactions, layouts, and so on.

When you have a collection of primitives that people are expected to use to create their specific features, interactions, and so on, you have a programming language and a programming environment. It doesn't really matter if this programming language is entirely declarative (and isn't necessarily Turing complete), as in the case of CSS; people have to program the web browser to get what they want.

So my view is that we were always going to wind up with at least one programming language in our web browsers, because a programming language is the meeting point between what web page authors want to have and what browser developers want to provide. The only question was (and is) how good of a programming language (or languages) we were going to get. Or perhaps an additional question was whether the people designing the 'programming language' were going to realize that they were doing so, or if they were going to create one through an accretion of features.

(My view is that CSS absolutely is a programming language in this sense, in that you must design and 'program' it in order to achieve the effects you want, especially if you want sophisticated ones like drop down menus. Modern CSS has thankfully moved beyond the days when I called it an assembly language.)

(This elaborates on a Fediverse post.)

(One comment.)

The modern web is why web browsers don't have "nice things" (platform APIs)

Chris's Wiki :: blog

By: cks

2 January 2025 at 04:00

Every so often I read something that says or suggests that the big combined browser and platform vendors (Google, Apple, and to a lesser extent Microsoft) have deliberately limited their browser's access to platform APIs that would put "progressive web applications" on par with native applications. While I don't necessarily want to say that these vendors are without sin, in my view this vastly misses the core reason web browsers have limited and slow moving access to platform APIs. To put it simply, it's because of what the modern web has turned into, namely "a hive of scum and villainy" to sort of quote a famous movie.

Any API the browser exposes to web pages is guaranteed to be used by bad actors, and this has been true for a long time. Bad actors will use these APIs to track people, to (try to) compromise their systems, to spy on them, or basically for anything that can make money or gain information. Many years ago I said this was why native applications weren't doomed and basically nothing has changed since then. In particular, browsers are no better at designing APIs that can't be abused or blocking web pages that abuse these APIs, and they probably never will be.

(One of the problems is the usual one in security; there are a lot more attackers than there are browser developers designing APIs, and the attackers only have to find one oversight or vulnerability. In effect attackers are endlessly ingenious while browser API designers have finite time they can spend if they want to ship anything.)

The result of this is that announcements of new browser APIs are greeted not with joy but with dread, because in practice they will mostly be yet another privacy exposure and threat vector (Chrome will often ship these APIs anyway because in practice as demonstrated by their actions, Google mostly doesn't care). Certainly there are some web sites and in-browser applications that will use them well, but generally they'll be vastly outnumbered by attackers that are exploiting these APIs. Browser vendors (even Google with Chrome) are well aware of these issues, which is part of why they create and ship so few APIs and often don't give them very much power.

(Even native APIs are increasingly restricted, especially on mobile devices, because there are similar issues on those. Every operating system vendor is more and more conscious of security issues and the exposures that are created for malicious applications.)

You might be tempted to say that the answer is forcing web pages to ask for permission to use these APIs. This is a terrible idea for at least two reasons. The first reason is alert (or question) fatigue; at a certain point this becomes overwhelming and people stop paying attention. The second reason is that people generally want to use websites that they're visiting, and if faced with a choice between denying a permission and being unable to use the website or granting the permission and being able to use the website, they will take the second choice a lot of the time.

(We can see both issues in effect in mobile applications, which have similar permissions requests and create similar permissions fatigue. And mobile applications ask for permissions far less often than web pages often would, because most people visit a lot more web pages than they install applications.)

(One comment.)

Thinking about how to tame the interaction of conditional GET and caching

Chris's Wiki :: blog

By: cks

21 November 2024 at 03:41

Due to how I do caching here, Wandering Thoughts has a long standing weird HTTP behavioral quirk where a non-conditional GET for a syndication feed here can get a different answer than a conditional GET. One (technical) way to explain this issue is that the cache validity interval for non-conditional GETs is longer than the cache validity interval for conditional GETs. In theory this could be the complete explanation of the issue, but in practice there's another part to it, which is that DWiki doesn't automatically insert responses into the cache on a cache miss.

(The cache is normally only filled for responses that were slow to generate, either due to load or because they're expensive. Otherwise I would rather dynamically generate the latest version of something and not clutter up cache space.)

There are various paths that I could take, but which ones I want to take depends on what my goals are and I'm actually not entirely certain about that. If my goal is to serve responses to unconditional GETs that are as fresh as possible but come from cache for as long as possible, what I should probably do is make conditional GETs update the cache when the cached version of the feed exists and would still have been served to an unconditional GET. I've already paid the cost to dynamically generate the feed, so I might as well serve it to unconditional GET requests. However, in my current cache architecture this would have the side effect of causing conditional GETs to get that newly updated cached copy for the conditional GET cache validity period, instead of generating the very latest feed dynamically (what would happen today).

(A sleazy approach would be to backdate the newly updated cache entry by the conditional GET validity interval. My current code architecture doesn't allow for that, so I can avoid the temptation.)

On the other hand, the entire reason I have a different (and longer) cache validity interval for unconditional GET requests is that in some sense I want to punish them. It's a deliberate feature that unconditional GETs receive stale responses, and in some sense the more stale the response the better. Even though updating the cache with a current response I've already generated is in some sense free, doing it cuts against this goal, both in general and in specific. In practice, Wandering Thoughts sees frequent enough conditional GETs for syndication feeds that making conditional GETs refresh the cached feed would effectively collapse the two cache validity intervals into one, which I can already do without any code changes. So if this is my main goal for cache handling of unconditional GETs of my syndication feed, the current state is probably fine and there's nothing to fix.

(A very approximate number is that about 15% of the syndication feed requests to Wandering Thoughts are unconditional GETs. Some of the offenders should definitely know and do better, such as 'Slackbot 1.0'.)

Syndication feed fetchers and their behavior on HTTP 429 status responses

Chris's Wiki :: blog

By: cks

11 November 2024 at 04:09

For reasons outside of the scope of this entry, recently I've been looking at the behavior of syndication feed fetchers here on Wandering Thoughts (which are generally from syndication feed readers), and in the process I discovered some that were making repeated requests at a quite aggressive rate, such as every five minutes. Until recently there was some excuse for this, because I wasn't setting a 'Cache-Control: max-age=...' header (also), which is (theoretically) used to tell Atom feed fetchers how soon they should re-fetch. I feel there was not much of an excuse because no feed reader should default to fetching every five minutes, or even every fifteen, but after I set my max-age to an hour there definitely should be no excuse.

Since sometimes I get irritated with people like this, I arranged to start replying to such aggressive feed featchers with a HTTP 429 "Too Many Requests" status response (the actual implementation is a hack because my entire software is more or less stateless, which makes true rate limiting hard). What I was hoping for is that most syndication feed fetching software would take this as a signal to slow down how often it tried to fetch the feed, and I'd see excessive sources move from one attempt every five minutes to (much) slower rates.

That basically didn't happen (perhaps this is no surprise). I'm sure there's good syndication feed fetching software that probably would behave that way on HTTP 429 responses, but whatever syndication feed software was poking me did not react that way. As far as I can tell from casually monitoring web access logs, almost no mis-behaving feed software paid any attention to the fact that it was specifically getting a response that normally means "you're doing this too fast". In some cases, it seems to have caused programs to try to fetch even more than before.

(Perhaps some of this is because I didn't add a 'Retry-After' header to my HTTP 429 responses until just now, but even without that, I'd expect clients to back off on their own, especially after they keep getting 429s when they retry.)

Given the HTTP User-Agents presented by feed fetchers, some of this is more or less expected, for two reasons. First, some of the User-Agents are almost certainly deliberate lies, and if a feed crawler is going to actively lie about what it is there's no reason for it to respect HTTP 429s either. Second, some of the feed fetching is being done by stateless programs like curl, where the people building ad-hoc feed fetching systems around them would have to go (well) out of their way to do the right thing. However, a bunch of the aggressive feed fetching is being done by either real feed fetching software with a real user-agent (such as "RSS Bot" or the Universal Feed Parser) or by what look like browser addons running in basically current versions of Firefox. I'd expect both of these to respect HTTP 429s if they're programmed decently. But then, if they were programmed decently they probably wouldn't be trying every five minutes in the first place.

(Hopefully the ongoing feed reader behavior project by rachelbythebay will fix some of this in the long run; there are encouraging signs, as covered in eg the October 25th score report.)

(One comment.)

Keeping your site accessible to old browsers is non-trivial

Chris's Wiki :: blog

By: cks

31 October 2024 at 03:13

One of the questions you could ask about whether or not to block HTTP/1.0 requests is what this does to old browsers and your site's accessibility to (or from) them (see eg the lobste.rs comments on my entry). The reason one might care about this is that old systems can usually only use old browsers, so to keep it possible to still use old systems you want to accommodate old browsers. Unfortunately the news there is not really great, and taking old browsers and old systems seriously has a lot of additional effects.

The first issue is that old systems generally can't handle modern TLS and don't recognize modern certificate authorities, like Let's Encrypt. This situation is only going to get worse over time, as websites increasingly require TLS 1.2 or better (and then in the future, TLS 1.3 or better). If you seriously care about keeping your site accessible to old browsers, you need to have a fully functional HTTP version. Increasingly, it seems that modern browsers won't like this, but so far they're willing to put up with it. I don't know if there's any good way to steer modern visitors to your HTTPS version instead of your HTTP version.

(This is one area where modern browsers preemptively trying HTTPS may help you.)

Next, old browsers obviously only support old versions of CSS, if they have very much CSS support at all (very old browsers probably won't). This can present a real conflict; you can have an increasingly basic site design that sticks within the bounds of what will render well on old browsers, or you can have one that looks good to what's probably the vast majority of your visitors and may or may not degrade gracefully on old browsers. Your CSS, if any, will probably also be harder to write, and it may be hard to test how well it actually works on old browsers. Some modern accessibility features, such as adjusting to screen sizes, may be (much) harder to get. If you want a multi-column layout or a sidebar, you're going to be back in the era of table based layouts (which this blog has never left, mostly because I'm lazy). And old browsers also mean old fonts, although with fonts it may be easier to degrade gracefully down to whatever default fonts the browser has.

(If you use images, there's the issue of image sizes and image formats. Old browsers are generally used on low resolution screens and aren't going to be the fastest or the best at scaling images down, if you can get them to do it as well. And you need to stick to image formats that they support.)

It's probably not impossible to do all of this, and you can test some of it by seeing how your site looks in text mode browsers like Lynx (which only really supports HTTP/1.0, as it turns out). But's certainly constraining; you have to really care, and it will cut you off from some things that are important and useful.

PS: I'm assuming that if you intend to be as fully usable as possible by old browsers, you're not even going to try to have JavaScript on your site.

(4 comments.)

The question of whether to still allow HTTP/1.0 requests or block them

Chris's Wiki :: blog

By: cks

29 October 2024 at 02:28

Recently, I discovered something and noted it on the Fediverse:

There are still a small number of things making HTTP/1.0 requests to my techblog. Many of them claim to be 'Chrome/124.<something>'. You know, I don't think I believe you, and I'm not sure my techblog should still accept HTTP/1.0 requests if all or almost all of them are malicious and/or forged.

The pure, standards-compliant answer to this is that of course you should still allow HTTP/1.0 requests. It remains a valid standard, and apparently some things may still default to it, and one part of the web's strength is its backward compatibility.

The pragmatic answer starts with the observation that HTTP/1.1 is now 25 years old, and any software that is talking HTTPS to you is demonstrably able to deal with standards that are more recent than that (generally much more recent, as sites require TLS 1.2 or better). And as a practical matter, pure HTTP/1.0 clients can't talk to many websites because such websites are name-based virtual hosts where the web server software absolutely requires a HTTP Host header before it will serve the website to you. If you leave out the Host header, at best you will get some random default site, perhaps a stub site.

(In a HTTPS context, web servers will also require TLS SNI and some will give you errors if the HTTP Host doesn't match the TLS SNI or is missing entirely. These days this causes HTTP/0.9 requests to be not very useful.)

If HTTP/1.0 requests were merely somewhere between a partial lie (in that everything that worked was actually supplying a Host header too) and useless (for things that didn't supply a Host), you could simply leave them be, especially if the volume was low. But my examination suggests strongly that approximately everything that is making HTTP/1.0 requests to Wandering Thoughts is actually up to no good; at a minimum they're some form of badly coded stealth spiders, quite possibly from would-be comment spammers that are trawling for targets. On a spot check, this seems to be true of another web server as well.

(A lot of the IPs making HTTP/1.0 requests provide claimed User-Agent headers that include ' Not-A.Brand/99 ', which appears to have been a Chrome experiment in putting random stuff in the User-Agent header. I don't see that in modern real Chrome user-agent strings, so I believe it's been dropped or de-activated since then.)

My own answer is that for now at least, I've blocked HTTP/1.0 requests to Wandering Thoughts. I'm monitoring what User-Agents get blocked, partly so I can perhaps exempt some if I need to, and it's possible I'll rethink the block entirely.

(Before you do this, you should certainly look at your own logs. I wouldn't expect there to be very many real HTTP/1.0 clients still out there, but the web has surprised me before.)

(One comment.)

The importance of name-based virtual hosts (websites)

Chris's Wiki :: blog

By: cks

27 October 2024 at 03:25

I recently read Geoff Huston's The IPv6 Transition, which is actually about why that transition isn't happening. A large reason for that is that we've found ways to cope with the shortage of IPv4 addresses, and one of the things Huston points to here is the introduction of the TLS Server Name Indicator (SNI) as drastically reducing the demand for IPv4 addresses for web servers. This is a nice story, but in actuality, TLS SNI was late to the party. The real hero (or villain) in taming what would otherwise have been a voracious demand for IPv4 addresses for websites is the HTTP Host header and the accompanying idea of name-based virtual hosts. TLS SNI only became important much later, when a mass movement to HTTPS hosts started to happen, partly due to various revelations about pervasive Internet surveillance.

In what is effectively the pre-history of the web, each website had to have its own IP(v4) address (an 'IP-based virtual host', or just your web server). If a single web server was going to support multiple websites, it needed a bunch of IP aliases, one per website. You can still do this today in web servers like Apache, but it has long since been superseded with name-based virtual hosts, which require the browser to send a Host: header with the other HTTP headers in the request. HTTP Host was officially added in HTTP/1.1, but I believe that back in the days basically everything accepted it even for HTTP 1.0 requests and various people patched it into otherwise HTTP/1.0 libraries and clients, possibly even before HTTP/1.1 was officially standardized.

(Since HTTP/1.1 dates from 1999 or so, all of this is ancient history by now.)

TLS SNI only came along much later. The Wikipedia timeline suggests the earliest you might have reasonably been able to use it was in 2009, and that would have required you to use a bleeding edge Apache; if you were using an Apache provided by your 'Long Term Support' Unix distribution, it would have taken years more. At the time that TLS SNI was introduced this was okay, because HTTPS (still) wasn't really seen as something that should be pervasive; instead, it was for occasional high-importance sites.

One result of this long delay for TLS SNI is that for years, you were forced to allocate extra IPv4 addresses and put extra IP aliases on your web servers in order to support multiple HTTPS websites, while you could support all of your plain-HTTP websites from a single IP. Naturally this served as a subtle extra disincentive to supporting HTTPS on what would otherwise be simple name-based virtual hosts; the only websites that it was really easy to support were ones that already had their own IPs (sometimes because they were on separate web servers, and sometimes for historical reasons if you'd been around long enough, as we had been).

(For years we had a mixed tangle of name-based and ip-based virtual hosts, and it was often difficult to recover the history of just why something was ip-based instead of name-based. We eventually managed to reform it down to only a few web servers and a few IP addresses, but it took a while. And even today we have a few virtual hosts that are deliberately ip-based for reasons.)

(2 comments.)

Increasing or Modifying Character Limits and Poll Options in Mastodon 4.3 and 4.4

IT+Notes

By: Stefano Marinelli

9 October 2024 at 12:53

Mastodon imposes a limit of 500 characters and 4 poll options. Here's how to change them.

Using a Permanent WebFinger Address for My Fediverse Profile

IT+Notes

By: Stefano Marinelli

8 October 2024 at 10:53

Using a Permanent WebFinger Address for My Fediverse Profile - Make it easy to find your profile from any instance.

A Small Compendium of Fediverse Platforms I Use

IT+Notes

By: Stefano Marinelli

12 September 2024 at 16:45

After revisiting my old Fediverse instances and helping friends set up new ones, I took the chance to update and evaluate several platforms. Here's my experience with Akkoma, GoToSocial, Mitra, Snac2, and Mastodon.

Make Your Own CDN with NetBSD

IT+Notes

By: Stefano Marinelli

2 September 2024 at 23:41

Learn how to build a self-hosted CDN using NetBSD, Varnish, and nginx

Make Your Own CDN with OpenBSD Base and Just 2 Packages

IT+Notes

By: Stefano Marinelli

28 August 2024 at 23:41

Learn how to build a self-hosted CDN using OpenBSD base and just 2 packages

Building a Self-Hosted CDN for BSD Cafe Media

IT+Notes

By: Stefano Marinelli

26 August 2024 at 06:41

Learn how to build a self-hosted CDN using FreeBSD jails, Nginx, Varnish, WireGuard, and PowerDNS to improve media delivery for your website without relying on external providers.

Syndication feed readers now seem to leave Last-Modified values alone

Chris's Wiki :: blog

By: cks

18 October 2024 at 03:08

A HTTP conditional GET is a way for web clients, such as syndication feed readers, to ask for a new copy of a URL only if the URL has changed since they last fetched it. This is obviously appealing for things, like syndication feed readers, that repeatedly poll URLs that mostly don't change, although syndication feed readers not infrequently get parts of this wrong. When a client makes a conditional GET, it can present an If-Modified-Since header, an If-None-Match header, or both. In theory, the client's If-None-Match value comes from the server's ETag, which is an opaque value, and the If-Modified-Since comes from the server's Last-Modified, which is officially a timestamp but which I maintain is hard to compare except literally.

I've long believed and said that many clients treat the If-Modified-Since header as a timestamp and so make up their own timestamp values; one historical example is Tiny Tiny RSS, and another is NextCloud-News. This belief led me to consider pragmatic handling of partial matches for HTTP conditional GET, and due to writing that entry, it also led me to actually instrument DWiki so I could see when syndication feed clients presented If-Modified-Since timestamps that were after my feed's Last-Modified. The result has surprised me. Out of the currently allowed feed fetchers, almost no syndication feed fetcher seems to present its own, later timestamp in requests, and on spot checks, most of them don't use too-old timestamps either.

(Even Tiny Tiny RSS may have changed its ways since I last looked at its behavior, although I'm keeping my special hack for it in place for now.)

Out of my reasonably well behaved, regular feed fetchers (other than Tiny Tiny RSS), only two uncommon ones regularly present timestamps after my Last-Modified value. And there are a lot of different User-Agents that managed to do a successful conditional GET of my syndication feed.

(There are, unfortunately, quite a lot of User-Agents that fetched my feed but didn't manage even a single successful conditional GET. But that's another matter, and some of them may have an extremely low polling interval. It would take me a lot more work to correlate this with which requests didn't even try any conditional GETs.)

This genuinely surprises me, and means I have to revise my belief that everyone mangles If-Modified-Since. Mostly they don't. As a corollary, parsing If-Modified-Since strings into timestamps and doing timestamp comparisons on them is probably not worth it, especially if Tiny Tiny RSS has genuinely changed.

(My preliminary data also suggests that almost no one has a different timestamp but a matching If-None-Match value, so my whole theory on pragmatic partial matches is irrelevant. As mentioned in an earlier entry, some feed readers get it wrong the other way around.)

PS: I believe that rachelbythebay's more systematic behavioral testing of feed readers has unearthed a variety of feed readers that have more varied If-Modified-Since behavior than I'm seeing; see eg this recent roundup. So actual results on your website may vary significantly depending on your readers and what they use.

Potential pragmatic handling of partial matches for HTTP conditional GET

Chris's Wiki :: blog

By: cks

12 October 2024 at 02:02

In HTTP, a conditional GET is a GET request that potentially can be replied with a HTTP '304 Not Modified' status; this is quite useful for polling relatively unchanging resources like syndication feeds (although syndication feed readers don't always do so well at it). Generally speaking, there are two potential validators for conditional GET requests; the If-None-Match header, validated against the ETag of the reply, and the If-Modified-Since header, validated against the Last-Modified of the reply. A HTTP client can remember and use either or both of your ETag and your Last-Modified values (assuming you provide both).

When a HTTP client sends both If-Modified-Since and If-None-Match, the fully correct, specifications compliant validation is to require both to match. This makes intuitive sense; both your ETag and your Last-Modified values are part of the state of what you're replying with, and if one doesn't match, the client has a different view of the URL's state than you do so you shouldn't claim it's 'not modified' from their state. Instead you should return the entire response so that they can update their view of your Last-Modified state.

In practice, two things potentially get in the way. First, it's common for syndication feed readers and other things to treat the 'If-Modified-Since' value they provide as a timestamp, not as an opaque string that echoes back your previous Last-Modified. Programs will put in what's probably some default time value, they'll use timestamps from internal events, and various other fun things. By contrast, your ETag value is opaque and has no meaning for programs to interpret, guess at, and make up; if a HTTP client sends an ETag, it's very likely to be one you provided (although this isn't certain). Second, it's not unusual for your ETag to be a much stronger validator than your Last-Modified; for example, your ETag may be a cryptographic hash of the contents and will definitely change if they do, while your Last-Modified is an imperfect approximation and may not change even if the content does.

In this situation, if a client presents an If-None-Match header that matches your current ETag and a If-Modified-Since that doesn't match your Last-Modified, it's extremely likely that they have your current content but have done one of the many things that make their 'timestamp' not match your Last-Modified. If you know you have a strong validator in your ETag and they're doing something like fetching your syndication feed (where it's very likely that they're going to do this a lot), it's pragmatically tempting to give them a HTTP 304 response even though you're technically not supposed to.

To reduce the temptation, you can change to comparing your Last-Modified value against people's If-Modified-Since as a timestamp (if you can parse their value that way), and giving people a HTTP 304 response if their timestamp is equal to or after yours. This is what I'd do today given how people actually handle If-Modified-Since, and it would work around many of the bad things that people do with If-Modified-Since (since usually they'll create timestamps that are more recent than your Last-Modified, although not always).

Despite everything I've written above, I don't know if this happens all that often. It's entirely possible that syndication feed readers and other programs that invent things for their If-Modified-Since values are also not using If-None-Match and ETag values. I've recently added instrumentation to the software here so that I can tell, so maybe I'll have more to report soon.

(If I was an energetic person I would hunt through the data that rachelbythebay has accumulated in their feed reader behavioral testing project to see what it has to say about this (the most recent update for which is here and I don't know of an overall index, see their archives). However, I'm not that energetic.)

(5 comments.)

Things syndication feed readers do with 'conditional GET'

Chris's Wiki :: blog

By: cks

8 October 2024 at 02:54

In HTTP, a conditional GET is a nice way of saving bandwidth (but not always work) when a web browser or other HTTP agent requests a URL that hasn't changed. Conditional GET is very useful for things that fetch syndication feeds (Atom or RSS), because they often try fetches much more often than the syndication feed actually changes. However, just because it would be a good thing if feed readers and other things did conditional GETs to fetch feeds doesn't mean that they actually do it. And when feed readers do try conditional GETs, they don't always do it right; for instance, Tiny Tiny RSS at least used to basically make up the 'If-Modified-Since' timestamps it sent (which I put in a hack for).

For reasons beyond the scope of this entry, I recently looked at my feed fetching logs for Wandering Thoughts. As usually happens when you turn over any rock involving web server logs, I discovered some multi-legged crawling things underneath, and in this case I was paying attention to what feed readers do (or don't do) for conditional GETs. Consider this a small catalog.

Some or perhaps all versions of NextCloud-News send an If-Modified-Since header with the value 'Wed, 01 Jan 1800 00:00:00 GMT'. This is always going to fail validation and turn into a regular GET request, whether you compare If-Modified-Since values literally or consider them as a timestamp and do timestamp comparisons. NextCloud-News might as well not bother sending an If-Modified-Since header at all.
A number of feed readers appear to only update their stored ETag value for your feed if your Last-Modified value also changes. In practice there are a variety of things that can change the ETag without changing the Last-Modified value, and some of them regularly happen here on Wandering Thoughts, which causes these feed readers to effectively decay into doing unconditional GET requests the moment, for example, someone leaves a new comment.
One feed reader sends If-Modified-Since values that use a numeric time offset, as in 'Mon, 07 Oct 2024 12:00:07 -0000'. This is also not a reformatted version of a timestamp I've ever given out, and is after the current Last-Modified value at the time the request was made. This client reliably attempts to pull my feed three times a day, at 02:00, 08:00, and 20:00, and the times of the If-Modified-Since values for those fetches are reliably 00:00, 06:00, and 12:00 respectively.
(I believe it may be this feed fetcher, but I'm not going to try to reverse engineer its If-Modified-Since generation.)
Another feed fetcher, possibly Firefox or an extension, made up its own timestamps that were set after the current Last-Modified of my feed at the time it made the request. It didn't send an If-None-Match header on its requests (ie, it didn't use the ETag I return). This is possibly similar to the Tiny Tiny RSS case, with the feed fetcher remembering the last time it fetched the feed and using that as the If-Modified-Since value when it makes another request.

All of this is what I turned over in a single day of looking at feed fetchers that got a lot of HTTP 200 results (as opposed to HTTP 304 results, which shows a conditional GET succeeding). Probably there are more fun things lurking out there.

(I'm happy to have people read my feeds and we're not short on bandwidth, so this is mostly me admiring the things under the rock rather than anything else. Although, some feed readers really need to slow down the frequency of their checks; my feed doesn't update every few minutes.)

(2 comments.)

My "time to full crawl" (vague) metric

Chris's Wiki :: blog

By: cks

18 September 2024 at 02:43

This entry, along with all of Wandering Thoughts (this blog) and in fact the entire wiki-thing it's part of is dynamically rendered from my wiki-text dialect to HTML. Well, in theory. In practice, one of the several layers of caching that make DWiki (this software) perform decently is a cache of the rendered HTML. Because DWiki is often running as an old fashioned Apache CGI, this rendering cache lives on disk.

(DWiki runs in a complicated way that can see it operating as a CGI under low load or as a daemon with a fast CGI frontend under higher load; this entry has more details.)

Since there are only so many things to render to HTML, this on disk cache has a maximum size that it stabilizes at; given enough time, everything gets visited and thus winds up in the disk cache of rendered HTML. The render disk cache lives in its own directory hierarchy, and so I can watch its size with a simple 'du -hs' command. Since I delete the entire cache every so often, this gives me an indicator that I can call either "time to full cache" or "time to full crawl". The time to full cache is how long it typically takes for the cache to reach maximum size, which is how long it takes for everything to be visited by something (or actually, used to render a URL that something visited).

I haven't attempted to systematically track this measure, but when I've looked it usually takes less than a week for the render cache to reach its stable 'full' size. The cache stores everything in separate files, so if I was an energetic person I could scan through the cache's directory tree, look at the file modification times, and generate some nice graphs of how fast the crawling goes (based on either the accumulated file sizes or the accumulated number of files, depending on what I was interested in).

(In theory I could do this from web server access logs. This would give me a somewhat different measure, since I'd be tracking what URLs had been accessed at least once instead of which bits of wikitext had been used in displaying URLs. At the same time, it might be a more interesting measure of how fast things are visited, and I do have a catalog of all page URLs here in the form of an automatically generated sitemap.)

PS: I doubt this is a single crawler visiting all of Wandering Thoughts in a week or so. Instead I expect it's the combination of the assorted crawlers (most of them undesirable), plus some amount of human traffic.

Apache's odd behavior for requests with a domain with a dot at the end

Chris's Wiki :: blog

By: cks

3 September 2024 at 03:16

When I wrote about the fun fact that domains can end in dots and how this affects URLs, I confidently said that Wandering Thoughts (this blog) reacted to being requested through 'utcc.utoronto.ca.' (with a dot at the end) by redirecting you to the canonical form, without the final dot. Then in comments, Alex reported that they got a Apache '400 Bad Request' response when they did it. From there, things got confusing (and are still confusing).

First, this response is coming from Apache, not DWiki (the code behind the blog). You can get the same '400 Bad Request' response from https://utcc.utoronto.ca./~cks/ (a static file handled only by this host's Apache). Second, you don't always get this response; what happens depends on what you're using to access the URL. Here's what I've noticed and tested so far:

In some tools you'll get a TLS certificate validation failure due to a name mismatch, presumably because 'utcc.utoronto.ca.' doesn't match 'utcc.utoronto.ca'. GNU Wget2 behaves this way.
(GNU Wget version 1.x doesn't seem to have this behavior; instead I think it may strip the final '.' off before doing much processing. My impression is that GNU Wget2 and 'GNU Wget (1.x)' are fairly different programs.)
on some Apache configurations, you'll get a TLS certificate validation error from everything, because Apache apparently doesn't think that that the 'dot at end' version of the host name matches any of its configured virtual host names, and so it falls back to a default TLS certificate that doesn't match what you asked for.
(This doesn't happen with this host's Apache configuration but it does happen on some other ones I tested with.)
against this host's Apache, at least lynx, curl, Safari on iOS (to my surprise), and manual testing all worked, with the request reaching DWiki and DWiki then generating a redirect to the canonical hostname. By a manual test, I mean making a TLS connection to port 443 with a tool of mine and issuing:
```
GET /~cks/space/blog/ HTTP/1.0
Host: utcc.utoronto.ca.
```
(And no other headers, although a random User-Agent doesn't seem to affect things.)
Firefox and I presume Chrome get the Apache '400 Bad Request' error (I don't use Chrome and I'm not going to start for this).

I've looked at the HTTP headers that Firefox's web developer tools says it's sending and they don't look particularly different or unusual. But something is getting Apache to decide this is a bad request.

(It's possible that some modern web security related headers are triggering this behavior in Apache, and only a few major browsers are sending them. I am a little bit surprised that Safari on iOS doesn't trigger this.)

(2 comments.)

The web fun fact that domains can end in dots and canonicalization failures

Chris's Wiki :: blog

By: cks

30 August 2024 at 02:58

Recently, my section of the Fediverse learned that the paywall of a large US-based news company could be bypassed simply by putting a '.' at the end of the website name. That is to say, you asked for 'https://newssite.com./article' instead of 'https://newssite.com/article'. People had a bit of a laugh (myself included) and also sympathized, because this is relatively obscure DNS trivia. Later, I found myself with a bit of a different view, which is that this is a failure of canonicalization in the web programming and web server environment.

(One theory for how this issue could happen is that the news company runs multiple sites from the same infrastructure and wants the paywall to only apply to some of them. Modern paywalls are relatively sophisticated programming, so I can easily imagine listing off the domains that should be affected by the paywall and missing the 'domain.' forms, perhaps because the people doing the programming simply don't know that bit of trivial.)

At the textual level, there are a lot of ways to vary host names and URLs. Hostnames are case independent, characters in URLs can be %-encoded, and so on (and I'm going to leave out structural modifications like '/./' and '/../' URL path elements or adding random query parameters). Web programming and web server environments already shield people from some of those by default; for example, if you configure a name-based virtual host, I think basically every web server will treat the name you provided as a case-independent one. Broadly we can consider this as canonicalizing the URL and other HTTP request information for you, so that you don't have to do it and thus you don't have to know all of the different variations that are possible.

It's my view that this canonicalization should also happen for host and domain names with dots at the end. Your web programming code should not have to even care about the possibility by default, any more than you probably have to care about it when configuring virtual hosts. If you really wanted to know low-level details about the request you should be able to, but the normal, easily accessible information you use for comparing and matching and so on should be canonicalized for you. This way it can be handled once by experts who know all of the crazy things that can appear in URLs, instead of repeatedly by web programmers who don't.

(Because if we make everyone handle this themselves we already know what's going to happen; some of them won't, and then we'll get various sorts of malfunctions, bugs, and security issues.)

PS: I've probably written some web related code that gets this wrong, treating 'domain.' and 'domain' as two separate things (and so probably denying access to the 'domain.' form as an unknown host). In fact if you try this here on Wandering Thoughts, you'll get a redirection to the dotless version of the domain, but this is because I put in a general 'redirect all weird domain variations to the canonical domain' feature a long time ago.

(My personal view is that redirecting to the canonical form of the domain is a perfectly valid thing to do in this situation.)

(10 comments.)

Cool URLs Mean Something

Pixel Envy

By: Nick Heer

1 August 2024 at 03:55

Tim Berners-Lee in 1998:

Keeping URIs so that they will still be around in 2, 20 or 200 or even 2000 years is clearly not as simple as it sounds. However, all over the Web, webmasters are making decisions which will make it really difficult for themselves in the future. Often, this is because they are using tools whose task is seen as to present the best site in the moment, and no one has evaluated what will happen to the links when things change. The message here is, however, that many, many things can change and your URIs can and should stay the same. They only can if you think about how you design them.

Jay Hoffmann:

Links give greater meaning to our webpages. Without the link, we would lose this significant grammatical tool native the web. And as links die out and rot on the vine, what’s at stake is our ability to communicate in the proper language of hypertext.

A dead link may not seem like it means very much, even in the aggregate. But they are. One-way links, the way they exist on the web where anyone can link to anything, is what makes the web universal. In fact, the first name for URL’s was URI’s, or Universal Resource Identifier. It’s right there in the name. And as Berners-Lee once pointed out, “its universality is essential.”

In 2018, Google announced it was deprecating its URL shortener, with no new links being created after March 2019. All existing shortened links would, however, remain active. It announced this in a developer blog post which — no joke — returns a 404 error at its original URL, which I found via 9to5Google. Google could not bother to redirect posts from just six years ago to their new valid URLs.

Google’s URL shortener was in the news again this month because the company has confirmed it will turn off these links in August 2025 except for those created via Google’s own apps. Google Maps, for example, still creates a goo.gl short link when sharing a location.

In principle, I support this deprecation because it is confusing and dangerous for Google’s own shortened URLs to have the same domain as ones created by third-party users. But this is a Google-created problem because it designed its URLs poorly. It should have never been possible for anyone else to create links with the same URL shortener used by Google itself. Yet, while it feels appropriate for a Google service to be unreliable over a long term, it also should not be ending access to links which may have been created just about five years ago.

By the way, the Sophos link on the word “dangerous” in that last paragraph? I found it via a ZDNet article where the inline link is — you guessed it — broken. Sophos also could not bother to redirect this URL from 2018 to its current address. Six years ago! Link rot is a scourge.

⌥ Permalink

Third-Party Cookies Have Got to Go

Pixel Envy

By: Nick Heer

30 July 2024 at 02:26

Anthony Chavez, of Google:

[…] Instead of deprecating third-party cookies, we would introduce a new experience in Chrome that lets people make an informed choice that applies across their web browsing, and they’d be able to adjust that choice at any time. We’re discussing this new path with regulators, and will engage with the industry as we roll this out.

Oh good — more choices.

Hadley Beeman, of the W3C’s Technical Architecture Group:

Third-party cookies are not good for the web. They enable tracking, which involves following your activity across multiple websites. They can be helpful for use cases like login and single sign-on, or putting shopping choices into a cart — but they can also be used to invisibly track your browsing activity across sites for surveillance or ad-targeting purposes. This hidden personal data collection hurts everyone’s privacy.

All of this data collection only makes sense to advertisers in the aggregate, but it only works because of specifics: specific users, specific webpages, and specific actions. Privacy Sandbox is imperfect but Google could have moved privacy forward by ending third-party cookies in the world’s most popular browser.

⌥ Permalink

⌥ Engineering Consent

Pixel Envy

By: Nick Heer

30 July 2024 at 02:02

Anthony Ha, of TechCrunch, interviewed Jean-Paul Schmetz, CEO of Ghostery, and I will draw your attention to this exchange:

AH I want to talk about both of those categories, Big Tech and regulation. You mentioned that with GDPR, there was a fork where there’s a little bit of a decrease in tracking, and then it went up again. Is that because companies realized they can just make people say yes and consent to tracking?

J-PS What happened is that in the U.S., it continued to grow, and in Europe, it went down massively. But then the companies started to get these consent layers done. And as they figured it out, the tracking went back up. Is there more tracking in the U.S. than there is in Europe? For sure.

AH So it had an impact, but it didn’t necessarily change the trajectory?

J-PS It had an impact, but it’s not sufficient. Because these consent layers are basically meant to trick you into saying yes. And then once you say yes, they never ask again, whereas if you say no, they keep asking. But luckily, if you say yes, and you have Ghostery installed, well, it doesn’t matter, because we block it anyway. And then Big Tech has a huge advantage because they always get consent, right? If you cannot search for something in Google unless you click on the blue button, you’re going to give them access to all of your data, and you will need to rely on people like us to be able to clean that up.

The TechCrunch headline summarizes this by saying “regulation won’t save us from ad trackers”, but I do not think that is a fair representation of this argument. What it sounds like, to me, is that regulations should be designed more effectively.

The E.U.’s ePrivacy Directive and GDPR have produced some results: tracking is somewhat less pervasive, people have a right to data access and portability, and businesses must give users a choice. That last thing is, as Schmetz points out, also its flaw, and one it shares with something like App Tracking Transparency on iOS. Apps affected by the latter are not permitted to keep asking if tracking is denied, but they do similarly rely on the assumption a user can meaningfully consent to a cascading system of trackers.

In fact, the similarities and differences between cookie banner laws and App Tracking Transparency are considerable. Both require some form of consent mechanism immediately upon accessing a website or an app, assuming a user can provide that choice. Neither can promise tracking will not occur should a user deny the request. Both are interruptive.

But cookie consent laws typically offer users more information; many European websites, for example, enumerate all their third-party trackers, while App Tracking Transparency gives users no visibility into which trackers will be allowed. The latter choice is remembered forever unless a user removes and reinstalls the app, while websites can ask you for cookie consent on each visit. Perhaps the latter may sometimes be a consequence of using Safari; it is hard to know.

App Tracking Transparency also has a system-wide switch to opt out of all third-party tracking. There used to be something similar in web browsers, but compliance was entirely optional. Its successor effort, Global Privacy Control, is sadly not as widely supported as it ought to be, but it appears to have legal teeth.

Both of these systems have another important thing in common: neither are sufficiently protective of users’ privacy because they burden individuals with the responsibility of assessing something they cannot reasonably comprehend. It is patently ridiculous to put the responsibility on individuals to mitigate a systemic problem like invasive tracking schemes.

There should be a next step to regulations like these because user tracking is not limited to browsers where Ghostery can help — if you know about it. A technological response is frustrating and it is unclear to me how effective it is on its own. This is clearly not a problem only regulation can solve but neither can browser extensions. We need both.

Modern web PKI (TLS) is very different than it used to be

Chris's Wiki :: blog

By: cks

3 August 2024 at 02:45

In yesterday's entry on the problems OCSP Stapling always faced, I said that OCSP Stapling felt like something from an earlier era of the Internet. In a way, this is literally true. The OCSP Stapling RFC was issued in January 2011, so the actual design work is even older. In 2011, Let's Encrypt was a year away from being started and the Snowden leaks about pervasive Internet interception (and 'SSL added and removed here') had not yet happened. HTTPS was a relative luxury, primarily deployed for security sensitive websites such as things that you had to log in to (and even that wasn't universal). Almost all Certificate Authorities charged money (and the ones that had free certificates sometimes failed catastrophically), the shortest TLS certificate you could get generally lasted for a year, and there were probably several orders of magnitude fewer active TLS certificates than there are today.

(It was also a different world in that browsers were much more tolerant of Certificate Authority misbehavior, so much so that I could write that I couldn't think of a significant CA that had been de-listed by browsers.)

The current world of web PKI is a very different place from that. Let's Encrypt, the current biggest CA, currently has almost 380 million active TLS certificates, HTTPS is increasingly expected and required by people and browsers (in order to enable various useful new bits of Javascript and so on), and a large portion of web traffic is HTTPS instead of HTTP. For good reasons, it's become well understood that everything should be HTTPS if at all possible. Commercial Certificate Authorities (ones that charge money for TLS certificates) face increasingly hard business challenges, since Let's Encrypt is free, and even their volume is probably up. With HTTPS connections being dominant, everything related to that is now on the critical path to them working and being speedy, placing significant demands on things like OCSP infrastructure.

(These demands would be much, much worse if Chrome, the dominant browser, checked the OCSP status of certificates. We don't really have an idea of how many CAs could stand up to that volume and how much it would cost them.)

In the before world of 2011, being a Certificate Authority was basically a license to print money if you could manage some basic business and operations competence. In the modern world of 2024, being a general Certificate Authority is a steadily increasing money sink with a challenging business model.

OCSP Stapling always faced a bunch of hard problems

Chris's Wiki :: blog

By: cks

2 August 2024 at 03:41

One reaction to my entry about how the Online Certificate Status Protocol (OCSP) is basically dead is to ask why OCSP Stapling was abandoned along with OCSP and why it didn't catch on. The answer, which will not please people who liked OCSP Stapling, is that OCSP Stapling was always facing a bunch of hard problems and it's not really a surprise that it failed to overcome them.

If OCSP Stapling was to really deliver serious security improvements, it had to be mandatory in the long run. Otherwise someone who had a stolen and revoked certificate could just use the certificate without any stapling and have you fall back to trusting it. The OCSP standard provided a way to do this, in the form of the 'OCSP Must Staple' option that you or your Certificate Authority could set in the signed TLS certificate. The original plan with OCSP Stapling was that it would just be an optimization to basic OCSP, but since basic OCSP turned out to be a bad idea and is now dead, OCSP Stapling must stand on its own. As a standalone thing, I believe that OCSP Stapling has to eventually require stapling, with CAs normally or always providing TLS certificates that set the 'must staple' option.

Getting a web server to do OCSP Stapling requires both software changes and operational changes. The basic TLS software has to provide stapled OCSP responses, getting them from somewhere, and then there has to be something that fetches signed OCSP responses from the CA periodically and stores them so that the TLS software could use them. There are a lot of potential operational changes here, because your web server may go from a static frozen thing that does not need to contact things in the outside world or store local state to something that needs to do both. Alternately, maybe you need to build an external system to fetch OCSP responses and inject them into the static web server environment, in much the same way that you periodically have to inject new TLS certificates.

(You could also try to handle this at the level of TLS libraries, but things rapidly get challenging and many people will be unhappy if their TLS library starts creating background threads that call out to Certificate Authority sites.)

There's a lot of web server software out there, with its development moving at different speeds, plus people then have to get around to deploying the new versions, which may literally take a decade or so. There are also a whole lot of people operating web servers, in a widely varied assortment of environments and with widely varied level of both technical skill and available time to change how they operate. And in order to get people to do all of this work, you have to persuade them that it's worth it, which was not helped by early OCSP stapling software having various operational issues that could make enabling OCSP stapling worse than not doing so.

(Some of these environments are very challenging to operate in or change. For example, there are environments where what is doing TLS is an appliance which only offers you the ability to manually upload new TLS certificates, and which is completely isolated from the Internet by design. A typical example is server management processors and server BMC networks. Organizations with such environments were simply not going to accept TLS certificates that required a weekly, hands-on process (to load a new set of OCSP responses) or giving BMCs access to the Internet.)

All of this created a situation where OCSP Stapling never gathered a critical mass of adoption. Software for it was slow to appear and balky when it did appear, many people did not bother to set stapling up even when what they were using eventually supported it, and it was pretty clear to everyone that there was little benefit to setting up OCSP stapling (and it was dangerous if you told your CA to give you TLS certificates with OCSP Must Staple set).

Looking back, OCSP Stapling feels like something designed for an earlier Internet, one that was both rather smaller and much more agile about software and software deployment. In the (very) early Internet you really could roll out a change like this and have it work relatively well. But by the time OCSP Stapling was being specified, the Internet was lot like that any more.

PS: As noted in the comments on my entry on OCSP's death, another problem with OCSP Stapling is that if used pervasively, it effectively requires CAs to create and sign a large number of mini-certificates on a roughly weekly basis, in the form of (signed) OCSP responses. These signed responses aren't on the critical latency path of web browser requests, but they do have to be reliable. The less reliable CAs are about generating them, the sooner web servers will try to renew them (for extra safety margin if it takes several attempts), adding more load.

(2 comments.)