❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Trapping Misbehaving Bots in an A.I. Labyrinth

By: Nick Heer
22 March 2025 at 04:32

Reid Tatoris, Harsh Saxena, and Luis Miglietti, of Cloudflare:

Today, we’re excited to announce AI Labyrinth, a new mitigation approach that uses AI-generated content to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect β€œno crawl” directives. When you opt in, Cloudflare will automatically deploy an AI-generated set of linked pages when we detect inappropriate bot activity, without the need for customers to create any custom rules.

Two thoughts:

  1. This is amusing. Nothing funnier than using someone’s own words or, in this case, technology against them.

  2. This is surely going to lead to the same arms race as exists now between privacy protections and hostile adtech firms. Right?

βŒ₯ Permalink

Doing multi-tag matching through URLs on the modern web

By: cks
14 March 2025 at 02:46

So what happened is that Mike Hoye had a question about a perfectly reasonable ideas:

Question: is there wiki software out there that handles tags (date, word) with a reasonably graceful URL approach?

As in, site/wiki/2020/01 would give me all the pages tagged as 2020 and 01, site/wiki/foo/bar would give me a list of articles tagged foo and bar.

I got nerd-sniped by a side question but then, because I'd been nerd-sniped, I started thinking about the whole thing and it got more and more hair-raising as a thing done in practice.

This isn't because the idea of stacking selections like this is bad; 'site/wiki/foo/bar' is a perfectly reasonable and good way to express 'a list of articles tagged foo and bar'. Instead, it's because of how everything on the modern web eventually gets visited combined with how, in the natural state of this feature, 'site/wiki/bar/foo' is just a valid a URL for 'articles tagged both foo and bar'.

The combination, plus the increasing tendency of things on the modern web to rattle every available doorknob just to see what happens, means that even if you don't advertise 'bar/foo', sooner or later things are going to try it. And if you do make the combinations discoverable through HTML links, crawlers will find them very fast. At a minimum this means crawlers will see a lot of essentially duplicated content, and you'll have to go through all of the work to do the searches and generate the page listings and so on.

If I was going to implement something like this, I would define a canonical tag order and then, as early in request processing as possible, generate a HTTP redirect from any non-canonical ordering to the canonical one. I wouldn't bother checking if the tags were existed or anything, just determine that they are tags, put them in canonical order, and if the request order wasn't canonical, redirect. That way at least all of your work (and all of the crawler attention) is directed at one canonical version. Smart crawlers will notice that this is a redirect to something they already have (and hopefully not re-request it), and you can more easily use caching.

(And if search engines still matter, the search engines will see only your canonical version.)

This probably holds just as true for doing this sort of tag search through query parameters on GET queries; if you expose the result in a URL, you want to canonicalize it. However, GET query parameters are probably somewhat safer if you force people to form them manually and don't expose links to them. So far, web crawlers seem less likely to monkey around with query parameters than with URLs, based on my limited experience with the blog.

Some views on the common Apache modules for SAML or OIDC authentication

By: cks
12 March 2025 at 03:01

Suppose that you want to restrict access to parts of your Apache based website but you want something more sophisticated and modern than Apache Basic HTTP authentication. The traditional reason for this was to support 'single sign on' across all your (internal) websites; the modern reason is that a central authentication server is the easiest place to add full multi-factor authentication. The two dominant protocols for this are SAML and OIDC. There are commonly available Apache authentication modules for both protocols, in the form of Mellon (also) for SAML and OpenIDC for OIDC.

I've now used or at least tested the Ubuntu 24.04 version of both modules against the same SAML/OIDC identity provider, primarily because when you're setting up a SAML/OIDC IdP you need to be able to test it with something. Both modules work fine, but after my experiences I'm more likely to use OpenIDC than Mellon in most situations.

Mellon has two drawbacks and two potential advantages. The first drawback is that setting up a Mellon client ('SP') is more involved. Most of annoying stuff is automated for you with the mellon_create_metadata script (which you can get from the Mellon repository if it's not in your Mellon package), but you still have to give your IdP your XML blob and get their XML blob. The other drawback is that Mellon isn't integrated into the Apache 'Require' framework for authorization decisions; instead you have to make do with Mellon-specific directives.

The first potential advantage is that Mellon has a straightforward story for protecting two different areas of your website with two different IdPs, if you need to do that for some reason; you can just configure them in separate <Location> or <Directory> blocks and everything works out. If anything, it's a bit non-obvious how to protect various disconnected bits of your URL space with the same IdP without having to configure multiple SPs, one for each protected section of URL space. The second potential advantage is that in general SAML has an easier story for your IdP giving you random information, and Mellon will happily export every SAML attribute it gets into the environment your CGI or web application gets.

The first advantage of OpenIDC is that it's straightforward to configure when you have a single IdP, with no XML and generally low complexity. It's also straightforward to protect multiple disconnected URL areas with the same IdP but possibly different access restrictions. A third advantage is that OpenIDC is integrated into Apache's 'Require' system, although you have to use OpenIDC specific syntax like 'Require claim groups:agroup' (see the OpenIDC wiki on authorization).

In exchange for this, it seems to be quite involved to use OpenIDC if you need to use multiple OIDC identity providers to protect different bits of your website. It's apparently possible to do this in the same virtual host but it seems quite complex and requires a lot of parts, so if I was confronted with this problem I would try very hard to confine each web thing that needed a different IdP into a different virtual host. And OpenIDC has the general OIDC problem that it's harder to expose random information.

(All of the important OpenIDC Apache directives about picking an IdP can't be put in <Location> or <Directory> blocks, only in a virtual host as a whole. If you care about this, see the wiki on Multiple Providers and also access to different URL paths on a per-provider basis.)

We're very likely to only ever be working with a single IdP, so for us OpenIDC is likely to be easier, although not hugely so.

Sidebar: The easy approach for group based access control with either

Both Mellon and OpenIDC work fine together with the traditional Apache AuthGroupFile directive, provided (of course) that you have or build an Apache format group file using what you've told Mellon or OpenIDC to use as the 'user' for Apache authentication. If your IdP is using the same user (and group) information as your regular system is, then you may well already have this information around.

(This is especially likely if you're migrating from Apache Basic HTTP authentication, where you already needed to build this sort of stuff.)

Building your own Apache group file has the additional benefit that you can augment and manipulate group information in ways that might not fit well into your IdP. Your IdP has the drawback that it has to be general; your generated Apache group file can be narrowly specific for the needs of a particular web area.

The web browser as an enabler of minority platforms

By: cks
11 March 2025 at 03:35

Recently, I got involved in a discussion on the Fediverse over what I will simplify to the desirability (or lack of it) of cross platform toolkits, including the browser, and how they erase platform personality and opinions. This caused me to have a realization about what web browser based applications are doing for me, which is that being browser based is what lets me use them at all.

My environment is pretty far from being a significant platform; I think Unix desktop share is in the low single percent under the best of circumstances. If people had to develop platform specific versions of things like Grafana (which is a great application), they'd probably exist for Windows, maybe macOS, and at the outside, tablets (some applications would definitely exist on phones, but Grafana is a bit of a stretch). They probably wouldn't exist on Linux, especially not for free.

That the web browser is a cross platform environment means that I get these applications (including the Fediverse itself) essentially 'for free' (which is to say, it's because of the efforts of web browsers to support my platform and then give me their work for free). Developers of web applications don't have to do anything to make them work for me, not even so far as making it possible to build their software on Linux; it just happens for them without them even having to think about it.

Although I don't work in the browser as much as some people do, looking back the existence of implicitly cross platform web applications has been a reasonably important thing in letting me stick with Linux.

This applies to any minority platform, not just Linux. All you need is a sufficiently capable browser and you have access to a huge range of (web) applications.

(Getting that sufficiently capable browser can be a challenge on a sufficiently minority platform, especially if you're not on a major architecture. I'm lucky in that x86 Linux is a majority minority platform; people on FreeBSD or people on architectures other than x86 and 64-bit ARM may be less happy with the situation.)

PS: I don't know if what we have used the web for really counts as 'applications', since they're mostly HTML form based things once you peel a few covers off. But if they do count, the web has been critical in letting us provide them to people. We definitely couldn't have built local application versions of them for all of the platforms that people here use.

(I'm sure this isn't a novel thought, but the realization struck (or re-struck) me recently so I'm writing it down.)

HTTP connections are part of the web's long tail

By: cks
22 February 2025 at 03:32

I recently read an article that, among other things, apparently seriously urging browser vendors to deprecate and disable plain text HTTP connections by the end of October of this year (via, and I'm deliberately not linking directly to the article). While I am a strong fan of HTTPS in general, I have some feelings about a rapid deprecation of HTTP. One of my views is that plain text HTTP is part of the web's long tail.

As I'm using the term here, the web's long tail (also is the huge mass of less popular things that are individually less frequently visited but which in aggregate amount to a substantial part of the web. The web's popular, busy sites are frequently updated and can handle transitions without problems. They can readily switch to using modern HTML, modern CSS, modern JavaScript, and so on (although they don't necessarily do so), and along with that update all of their content to HTTPS. In fact they mostly or entirely have done so over the last ten to fifteen years. The web's long tail doesn't work like that. Parts of it use old JavaScript, old CSS, old HTML, and these days, plain HTTP (in addition to the people who have objections to HTTPS and deliberately stick to HTTP).

The aggregate size and value of the long tail is part of why browsers have maintained painstaking compatibility back to old HTML so far, including things like HTML Image Maps. There's plenty of parts of the long tail that will never be updated to have HTTPS or work properly with it. For browsers to discard HTTP anyway would be to discard that part of the long tail, which would be a striking break with browser tradition. I don't think this is very likely and I certainly hope that it never comes to pass, because that long tail is part of what gives the web its value.

(It would be an especially striking break since a visible percentage of page loads still happen with HTTP instead of HTTPS. For example, Google's stats say that globally 5% of Windows Chrome page loads apparently still use HTTP. That's roughly one in twenty page loads, and the absolute number is going to be very large given how many page loads happen with Chrome on Windows. This large number is one reason I don't think this is at all a serious proposal; as usual with this sort of thing, it ignores that social problems are the ones that matter.)

PS: Of course, not all of the HTTP connections are part of the web's long tail as such. Some of them are to, for example, manage local devices via little built in web servers that simply don't have HTTPS. The people with these devices aren't in any rush to replace them just because some people don't like HTTP, and the vendors who made them aren't going to update their software to support (modern) HTTPS even for the devices which support firmware updates and where the vendor is still in business.

(You can view them as part of the long tail of 'the web' as a broad idea and interface, even though they're not exposed to the world the way that the (public) web is.)

More potential problems for people with older browsers

By: cks
18 February 2025 at 03:40

I've written before that keeping your site accessible to very old browsers is non-trivial because of issues like them not necessarily supporting modern TLS. However, there's another problem that people with older browsers are likely to be facing, unless circumstances on the modern web change. I said on the Fediverse:

Today in unfortunate web browser developments: I think people using older versions of browsers, especially Chrome, are going to have increasing problems accessing websites. There are a lot of (bad) crawlers out there forging old Chrome versions, perhaps due to everyone accumulating AI training data, and I think websites are going to be less and less tolerant of them.

(Mine sure is currently, as an experiment.)

(By 'AI' I actually mean LLM.)

I covered some request volume information yesterday and it (and things I've seen today) strongly suggest that there is a lot of undercover scraping activity going on. Much of that scraping activity uses older browser User-Agents, often very old, which means that people who don't like it are probably increasingly going to put roadblocks in the way of anything presenting those old User-Agent values (there are already open source projects designed to frustrate LLM scraping and there will probably be more in the future).

(Apparently some LLM scrapers start out with honest User-Agents but then switch to faking them if you block their honest versions.)

There's no particular reason why scraping software can't use current User-Agent values, but it probably has to be updated every so often when new browser versions come out and people haven't done that so far. Much like email anti-spam efforts changing email spammer behavior, this may change if enough websites start reacting to old User-Agents, but I suspect that it will take a while for that to come to pass. Instead I expect it to be a smaller scale, distributed effort from 'unimportant' websites that are getting overwhelmed, like LWN (see the mention of this in their 'what we haven't added' section).

Major websites probably won't outright reject old browsers, but I suspect that they'll start throwing an increased amount of blocks in the way of 'suspicious' browser sessions with those User-Agents. This is probably likely to include CAPTCHAs and other such measures that they already use some of the time. CAPTCHAs aren't particularly effective at stopping bad actors in practice but they're the hammer that websites already have, so I'm sure they'll be used on this nail.

Another thing that I suspect will start happening is that more sites will start insisting that you run some JavaScript to pass a test in order to access them (whether this is an explicit CAPTCHA or just passive JavaScript that has to execute). This will stop LLM scrapers that don't run JavaScript, which is not all of them, and force the others to spend a certain amount of CPU and memory, driving up the aggregate cost of scraping your site dry. This will of course adversely affect people without JavaScript in their browser and those of us who choose to disable it for most sites, but that will be seen as the lesser evil by people who do this. As with anti-scraper efforts, there are already open source projects for this.

(This is especially likely to happen if LLM scrapers modernize their claimed User-Agent values to be exactly like current browser versions. People are going to find some defense.)

PS: I've belatedly made the Wandering Thoughts blocks for old browsers now redirect people to a page about the situation. I've also added a similar page for my current block of most HTTP/1.0 requests.

The HTTP status codes of responses from about 21 hours of traffic to here

By: cks
17 February 2025 at 04:06

You may have heard that there are a lot of crawlers out there these days, many of them apparently harvesting training data for LLMs. Recently I've been getting more strict about access to this blog, so for my own interest I'm going to show statistics on what HTTP status codes all of the requests to here got in the past roughly 21 hours and a bit. I think this is about typical, although there may be more blocked things than usual.

I'll start with the overall numbers for all requests:

 22792 403      [45%]
  9207 304      [18.3%]
  9055 200      [17.9%]
  8641 429      [17.1%]
   518 301
    58 400
    33 404
     2 206
     1 302

HTTP 403 is the error code that people get on blocked access; I'm not sure what's producing the HTTP 400s. The two HTTP 206s were from LinkedIn's bot against a recent entry and completely puzzle me. Some of the blocked access is major web crawlers requesting things that they shouldn't (Bing is a special repeat offender here), but many of them are not. Between HTTP 403s and HTTP 429s, 62% or so of the requests overall were rejected and only 36% got a useful reply.

(With less thorough and active blocks, that would be a lot more traffic for Wandering Thoughts to handle.)

The picture for syndication feeds is rather different, as you might expect, but not quite as different as I'd like:

  9136 304    [39.5%]
  8641 429    [37.4%]
  3614 403    [15.6%]
  1663 200    [ 7.2%]
    19 301

Some of those rejections are for major web crawlers and almost a thousand are for a pair of prolific, repeat high volume request sources, but a lot of them aren't. Feed requests account for 23073 requests out of a total of 50307, or about 45% of the requests. To me this feels quite low for anything plausibly originated from humans; most of the time I expect feed requests to significantly outnumber actual people visiting.

(In terms of my syndication feed rate limiting, there were 19440 'real' syndication feed requests (84% of the total attempts), and out of them 44.4% were rate-limited. That's actually a lower level of rate limiting than I expected; possibly various feed fetchers have actually noticed it and reduced their attempt frequency. 46.9% made successful conditional GET requests (ones that got a HTTP 304 response) and 8.5% actually fetched feed data.)

DWiki, the wiki engine behind the blog, has a concept of alternate 'views' of pages. Syndication feeds are alternate views, but so are a bunch of other things. Excluding syndication feeds, the picture for requests of alternate views of pages is:

  5499 403
   510 200
    39 301
     3 304

The most blocked alternate views are:

  1589 ?writecomment
  1336 ?normal
  1309 ?source
   917 ?showcomments

(The most successfully requested view is '?showcomments', which isn't really a surprise to me; I expect search engines to look through that, for one.)

If I look only at plain requests, not requests for syndication feeds or alternate views, I see:

 13679 403   [64.5%]
  6882 200   [32.4%]
   460 301
    68 304
    58 400
    33 404
     2 206
     1 302

This means the breakdown of traffic is 21183 normal requests (42%), 45% feed requests, and the remainder for alternate views, almost all of which were rejected.

Out of the HTTP 403 rejections across all requests, the 'sources' break down something like this:

  7116 Forged Chrome/129.0.0.0 User-Agent
  1451 Bingbot
  1173 Forged Chrome/121.0.0.0 User-Agent
   930 PerplexityBot ('AI' LLM data crawler)
   915 Blocked sources using a 'Go-http-client/1.1' User-Agent

Those HTTP 403 rejections came from 12619 different IP addresses, in contrast to the successful requests (HTTP 2xx and 3xx codes), which came from 18783 different IP addresses. After looking into the ASN breakdown of those IPs, I've decided that I can't write anything about them with confidence, and it's possible that part of what is going on is that I have mis-firing blocking rules (alternately, I'm being hit from a big network of compromised machines being used as proxies, perhaps the same network that is the Chrome/129.0.0.0 source). However, some of the ASNs that show up highly are definitely ones I recognize from other contexts, such as attempted comment spam.

Update: Well that was a learning experience about actual browser User-Agents. Those 'Chrome/129.0.0.0' User-Agents may well not have been so forged (although people really should be running more current versions of Chrome). I apologize to the people using real current Chrome versions that were temporarily unable to read the blog because of my overly-aggressive blocks.

Web application design and the question of what is a "route"

By: cks
8 February 2025 at 04:16

So what happened is that Leah Neukirchen ran a Fediverse poll on how many routes your most complex web app had, and I said that I wasn't going to try to count how many DWiki had and then gave an example of combining two things in a way that I felt was a 'route' (partly because 'I'm still optimizing the router' was one poll answer). This resulted in a discussion where one of the questions I draw from it is "what is a route, exactly".

At one level counting up routes in your web application seems simple. For instance, in our Django application I could count up the URL patterns listed in our 'urlpatterns' setting (which gives me a larger number than I expected for what I think of as a simple Django application). Pattern delegation may make this a bit tedious, but it's entirely tractable. However, I think that this only works for certain sorts of web applications that are designed in a particular way, and as it happens I have an excellent example of where the concept of "route" gets fuzzy.

DWiki, the engine behind this blog, is actually a general filesystem based wiki (engine). As a filesystem based wiki, what it started out doing was to map any URL path to a filesystem object and then render the filesystem object in some appropriate way; for example, directories turn into a listing of their contents. With some hand-waving you could say that this is one route, or two once we through in an optional system for handling static assets. Alternately you could argue that this is two (or three) routes, one route for directories and one route for files, because the two are rendered differently (although that's actually implemented in templates, not in code, so maybe they're one route after all).

Later I added virtual directories, which are added to the end of directory paths and are used to restrict what things are visible within the directory (or directory tree). Both the URL paths involved and the actual matching against them look like normal routing (although they're not handled through a traditional router approach), so I should probably count them as "routes", adding four or so more routes, so you could say that DWiki has somewhere between five and seven routes (if you count files and directories separately and throw in a third route for static asset files).

However, I've left out a significant detail, which is visible in how both the blog's front page and the Atom syndication feed of the blog use the same path in their URLs, and the blog's front page looks nothing like a regular directory listing. What's going on is that how DWiki presents both files and especially directories depends on the view they're shown in, and DWiki has a bunch of views; all of the above differences are because of different views being used. Standard blog entry files can be presented in (if I'm counting right) five different views. Directories have a whole menagerie of views that they support, including a 'blog' view. Because views are alternate presentations of a given filesystem object and thus URL path, they're provided as a query parameter, not as part of the URL's path.

Are DWiki's views routes, and if they are, how do we count them? Is each unique combination of a page type (including virtual directories) and a view a new route? One thing that may affect your opinion of this is that a lot of the implementation of views is actually handled in DWiki's extremely baroque templates, not code. However, DWiki's code knows a full list of what views exist (and templates have to be provided or you'll get various failures).

(I've also left out a certain amount of complications, like redirections and invalid page names.)

The broad moral I draw from this exercise is that the model of distinct 'routes' is one that only works for certain sorts of web application design. When and where it works well, it's a quite useful model and I think it pushes you toward making good decisions about how to structure your URLs. But in any strong form, it's not a universal pattern and there are ways to go well outside it.

(Interested parties can see a somewhat out of date version of DWiki's code and many templates, although note that both contain horrors. At some point I'll probably update both to reflect my recent burst of hacking on DWiki.)

Web spiders (or people) can invent unfortunate URLs for your website

By: cks
3 February 2025 at 00:55

Let's start with my Fediverse post:

Today in "spiders on the Internet do crazy things": my techblog lets you ask for a range of entries. Normally the range that people ask for is, say, ten entries (the default, which is what you normally get links for). Some deranged spider out there decided to ask for a thousand entries at once and my blog engine sighed, rolled up its sleeves, and delivered (slowly and at large volume).

In related news, my blog engine can now restrict how large a range people can ask for (although it's a hack).

DWiki is the general wiki engine that creates Wandering Thoughts. As part of its generality, it has a feature that shows a range of 'pages' (in Wandering Thoughts these are entries, in general these are files in a directory tree), through what I call virtual directories. As is usual with these things, the range of entries (pages, files) that you're asking for is specified in the URL, with syntax like '<whatever>/range/20-30'.

If you visit the blog front page or similar things, the obvious and discoverable range links you get are for ten entries. You can under some situations get links for slightly bigger ranges, but not substantially larger ones. However, the engine didn't particularly restrict the size of these ranges, so if you wanted to create URLs by hand you could ask for very large ranges.

Today, I discovered that two IPs had asked for 1000-entry ranges today, and the blog engine provided them. Based on some additional log information, it looks like it's not the first time that giant ranges have been requested. One of those IPs was an AWS IP, for which my default assumption is that this is a web spider of some source. Even if it's not a conventional web spider, I doubt anyone is asking for a thousand entries at once with the plan of reading them all; that's a huge amount of text, so it's most likely being done to harvest a lot of my entries at once for some purpose.

(Partly because of that and partly because it puts a big load on DWiki, I've now hacked in a mentioned feature to restrict how large a range you can request. Because it's a hack, too-large ranges get HTTP 404 responses instead of something more useful.)

Sidebar: on the "virtual directories" name and feature

All of DWiki's blog parts are alternate views of a directory hierarchy full of files, where each file is a 'page' and in the context of Wandering Thoughts, almost all pages are blog entries (on the web, the 'See as Normal' link at the bottom will show you the actual directory view of something). A 'virtual directory' is a virtual version of the underlying real directory or directory hierarchy that only shows some pages, for example pages from 2025 or a range of pages based on how recent they are.

All of this is a collection of hacks built on top of other hacks, because that's what happens when you start with a file based wiki engine and decide you can make it be a blog too with only a few little extra features (as a spoiler, it did not wind up requiring only a few extra things). For example, you might wonder how the blog's front page winds up being viewed as a chronological blog, instead of a directory, and the answer is a hack.

Some learning experiences with HTTP cookies in practice

By: cks
27 January 2025 at 03:29

Suppose, not hypothetically, that you have a dynamic web site that makes minor use of HTTP cookies in a way that varies the output, and also this site has a caching layer. Naturally you need your caching layer to only serve 'standard' requests from cache, not requests that should get something non-standard. One obvious and simple approach is to skip your cache layer for any request that has a HTTP cookie. If you (I) do this, I have bad news about HTTP requests in practice, at least for syndication feed fetchers.

(One thing you might do with HTTP cookies is deliberately bypass your own cache, for example to insure that someone who posts a new comment can immediately see their own comment, even if an older version of the page is in the cache.)

The thing about HTTP cookies is that the HTTP client can send you anything it likes as a HTTP cookie and unfortunately some clients will. For example, one feed reader fetcher deliberately attempts to bypass Varnish caches by sending a cookie with all fetch requests, so if the presence of any HTTP cookie causes you to skip your own cache (and other things you do that use the same logic), well, feeder.co is bypassing your caching layer too. Another thing that happens is that some syndication feed fetching clients appear to sometimes leak unrelated cookies into their HTTP requests.

(And of course if your software is hosted along side other software that might set unrestricted cookies for the entire website, those cookies may leak into requests made to your software. For feed fetching specifically, this is probably most likely in feed readers that are browser addons.)

The other little gotcha is that you shouldn't rely on merely the presence or absence of a 'Cookie:' header in the request to tell you if the request has cookies, because a certain number of HTTP clients appear to send a blank Cookie: header (ie, just 'Cookie:'). You might be doing this directly in a CGI by checking for the presence of $HTTP_COOKIE, or you might be doing this indirectly by parsing any Cookie: header in the request into a 'Cookies' object of some sort (even if the value is blank), in which case you'll wind up with an empty Cookies object.

(You can also receive cookies with a blank value in a Cookies: header, eg 'JSESSIONID=', which appears to be a deliberate decision by the software involved, and seems to be to deal with a bad feed source.)

If you actually care about all of this, as I do now that I've discovered it all, you'll want to specifically check for the presence of your own cookies and ignore any other cookies you see, as well as a blank 'Cookie:' HTTP header. Doing extra special things if you see a 'bypass_varnish=1' cookie is up to you.

(In theory I knew that the HTTP Cookies: header was untrusted client data and shouldn't be trusted, and sometimes even contained bad garbage (which got noted every so often in my logs). In practice I didn't think about the implications of that for some of my own code until now.)

Syndication feeds here are now rate-limited on a per-IP basis

By: cks
26 January 2025 at 03:30

For a long time I didn't look very much at the server traffic logs for Wandering Thoughts, including what was fetching my syndication feeds and how, partly because I knew that looking at web server logs invariably turns over a rock or two. In the past few months I started looking at my feed logs, and then I spent some time trying to get some high traffic sources to slow down on an ad-hoc basis, which didn't have much success (partly because browser feed reader addons seem bad at this). Today I finally gave in to temptation and added general per-IP rate limiting for feed requests. A single IP that requests a particular syndication feed too soon after its last successful request will receive a HTTP 429 response.

(The actual implementation is a hack, which is one reason I didn't do it before now; DWiki, the engine behind Wandering Thoughts, doesn't have an easy place for dynamically updated shared state.)

This rate-limiting will probably only moderately reduce the load on Wandering Thoughts, for various reasons, but it will make me happier. I'm also looking forward to having a better picture of what I consider 'actual traffic' to Wandering Thoughts, including actual User-Agent usage, without the distortions added by badly behaved browser addons (I'm pretty sure that my casual view of Firefox's popularity for visitors has been significantly distorted by syndication feed over-fetching).

In applying this rate limiting, I've deliberately decided not to exempt various feed reader providers like NewsBlur, Feedbin, Feedly, and so on. Hopefully all of these places will react properly to receiving periodic HTTP 429 requests and not, say, entirely give up fetching my feeds after a while because they're experiencing 'too many errors'. However, time will tell if this is correct (and if my HTTP 429 responses cause them to slow down their often quite frequent syndication feed requests).

In general I'm going to have to see how things develop, and that's a decent part of why I'm doing this at all. I'm genuinely curious how clients will change their behavior (if they do) and what will emerge, so I'm doing a little experiment (one that's nowhere as serious and careful as rachelbythebay's ongoing work).

PS: The actual rate limiting applies a much higher minimum interval for unconditional HTTP syndication feed requests than for conditional ones, for the usual reason that I feel repeated unconditional requests for syndication feeds is rather antisocial, and if a feed fetcher is going to be antisocial I'm not going to talk to it very often.

More features for web page generation systems doing URL remapping

By: cks
23 January 2025 at 04:08

A few years ago I wrote about how web page generation systems should support remapping external URLs (this includes systems that convert some form of wikitext to HTML). At the time I was mostly thinking about remapping single URLs and mentioned things like remapping prefixes (so you could remap an entire domain into web.archive.org) as something for a fancier version. Well, the world turns and things happen and I now think that such prefix remapping is essential; even if you don't start out with it, you're going to wind up with it in the longer term.

(To put it one way, the reality of modern life is that sometimes you no longer want to be associated with some places. And some day, my Fediverse presence may also move.)

In light of a couple of years of churn in my website landscape (after what was in hindsight a long period of stability), I now have revised views on the features I want in a (still theoretical) URL remapping system for Wandering Thoughts. The system I want should be able to remap individual URLs, entire prefixes, and perhaps regular expressions with full scale rewrites (or maybe some scheme with wildcard matching), although I don't currently have a use for full scale regular expression rewrites. As part of this, there needs to be some kind of priority or hierarchy between different remappings that can all potentially match the same URL, because there's definitely at least one case today where I want to remap 'asite/a/*' somewhere and all other 'asite/*' URLs to something else. While it's tempting to do something like 'most specific thing matches', working out what is most specific from a collection of different sorts of remapping rules seems a bit hard, so I'd probably just implement it as 'first match wins' and manage things by ordering matches in the configuration file.

('Most specific match wins' is a common feature in web application frameworks for various reasons, but I think it's harder to implement here, especially if I allow arbitrary regular expression matches.)

Obviously the remapping configuration file should support comments (every configuration system needs to). Less obviously, I'd support file inclusion or the now common pattern of a '<whatever>.d' directory for drop in files, so that remapping rules can be split up by things like the original domain rather than having to all be dumped into an ever-growing single configuration file.

(Since more and more links rot as time passes, we can pretty much guarantee that the number of our remappings is going to keep growing.)

Along with the remapping, I may want something (ie, a tiny web application) that dynamically generates some form of 'we don't know where you can find this now but here is what the URL used to be' page for any URL I feed it. The obvious general reason for this is that sometimes old domain names get taken over by malicious parties and the old content is nowhere to be found, not even on web.archive.org. In that case you don't want to keep a link to what's now a malicious site, but you also don't have any other valid target for your old link. You could rewrite the link to some invalid domain name and leave it to the person visiting you and following the link to work out what happened, but it's better to be friendly.

(This is where you want to be careful about XSS and other hazards of operating what is basically an open 'put text in and we generate a HTML page with it shown in some way' service.)

The programmable web browser was and is inevitable

By: cks
4 January 2025 at 03:40

In a comment on my entry on why the modern web is why web browsers can't have nice things, superkuh wrote in part:

In the past it was seen as crazy to open every executable file someone might send you over the internet (be it email, ftp, web, or whatever). But sometime in the 2010s it became not only acceptable, but standard practice to automatically run every executable sent to you by any random endpoint on the internet.

For 'every executable' you should read 'every piece of JavaScript', which is executable code that is run by your browser as a free and relatively unlimited service provided to every web page you visit. The dominant thing restraining the executables that web pages send you is the limited APIs that browsers provide, which is why they provide such limited APIs. This comment sparked a chain of thoughts that led to a thesis.

I believe that the programmable web browser was (and is) inevitable. I don't mean this just in the narrow sense that if it hadn't been JavaScript it would have been Flash or Java applets or Lua or WASM or some other relatively general purpose language that the browser would up providing. Instead, I mean it in a broad and general sense, because 'programmability' of the browser is driven by a general and real problem.

For almost as long as the web has existed, people have wanted to create web pages that had relatively complex features and interactions. They had excellent reasons for this; they wanted drop-down or fold-out menus to save screen space so that they could maximize the amount of space given to important stuff instead of navigation, and they wanted to interactively validate form contents before submission for fast feedback to the people filling them in, and so on. At the same time, browser developers didn't want to (and couldn't) program every single specific complex feature that web page authors wanted, complete with bespoke HTML markup for it and so on. To enable as many of these complex features as possible with as little work on their part as possible, browser developers created primitives that could be assembled together to create more sophisticated features, interactions, layouts, and so on.

When you have a collection of primitives that people are expected to use to create their specific features, interactions, and so on, you have a programming language and a programming environment. It doesn't really matter if this programming language is entirely declarative (and isn't necessarily Turing complete), as in the case of CSS; people have to program the web browser to get what they want.

So my view is that we were always going to wind up with at least one programming language in our web browsers, because a programming language is the meeting point between what web page authors want to have and what browser developers want to provide. The only question was (and is) how good of a programming language (or languages) we were going to get. Or perhaps an additional question was whether the people designing the 'programming language' were going to realize that they were doing so, or if they were going to create one through an accretion of features.

(My view is that CSS absolutely is a programming language in this sense, in that you must design and 'program' it in order to achieve the effects you want, especially if you want sophisticated ones like drop down menus. Modern CSS has thankfully moved beyond the days when I called it an assembly language.)

(This elaborates on a Fediverse post.)

The modern web is why web browsers don't have "nice things" (platform APIs)

By: cks
2 January 2025 at 04:00

Every so often I read something that says or suggests that the big combined browser and platform vendors (Google, Apple, and to a lesser extent Microsoft) have deliberately limited their browser's access to platform APIs that would put "progressive web applications" on par with native applications. While I don't necessarily want to say that these vendors are without sin, in my view this vastly misses the core reason web browsers have limited and slow moving access to platform APIs. To put it simply, it's because of what the modern web has turned into, namely "a hive of scum and villainy" to sort of quote a famous movie.

Any API the browser exposes to web pages is guaranteed to be used by bad actors, and this has been true for a long time. Bad actors will use these APIs to track people, to (try to) compromise their systems, to spy on them, or basically for anything that can make money or gain information. Many years ago I said this was why native applications weren't doomed and basically nothing has changed since then. In particular, browsers are no better at designing APIs that can't be abused or blocking web pages that abuse these APIs, and they probably never will be.

(One of the problems is the usual one in security; there are a lot more attackers than there are browser developers designing APIs, and the attackers only have to find one oversight or vulnerability. In effect attackers are endlessly ingenious while browser API designers have finite time they can spend if they want to ship anything.)

The result of this is that announcements of new browser APIs are greeted not with joy but with dread, because in practice they will mostly be yet another privacy exposure and threat vector (Chrome will often ship these APIs anyway because in practice as demonstrated by their actions, Google mostly doesn't care). Certainly there are some web sites and in-browser applications that will use them well, but generally they'll be vastly outnumbered by attackers that are exploiting these APIs. Browser vendors (even Google with Chrome) are well aware of these issues, which is part of why they create and ship so few APIs and often don't give them very much power.

(Even native APIs are increasingly restricted, especially on mobile devices, because there are similar issues on those. Every operating system vendor is more and more conscious of security issues and the exposures that are created for malicious applications.)

You might be tempted to say that the answer is forcing web pages to ask for permission to use these APIs. This is a terrible idea for at least two reasons. The first reason is alert (or question) fatigue; at a certain point this becomes overwhelming and people stop paying attention. The second reason is that people generally want to use websites that they're visiting, and if faced with a choice between denying a permission and being unable to use the website or granting the permission and being able to use the website, they will take the second choice a lot of the time.

(We can see both issues in effect in mobile applications, which have similar permissions requests and create similar permissions fatigue. And mobile applications ask for permissions far less often than web pages often would, because most people visit a lot more web pages than they install applications.)

Thinking about how to tame the interaction of conditional GET and caching

By: cks
21 November 2024 at 03:41

Due to how I do caching here, Wandering Thoughts has a long standing weird HTTP behavioral quirk where a non-conditional GET for a syndication feed here can get a different answer than a conditional GET. One (technical) way to explain this issue is that the cache validity interval for non-conditional GETs is longer than the cache validity interval for conditional GETs. In theory this could be the complete explanation of the issue, but in practice there's another part to it, which is that DWiki doesn't automatically insert responses into the cache on a cache miss.

(The cache is normally only filled for responses that were slow to generate, either due to load or because they're expensive. Otherwise I would rather dynamically generate the latest version of something and not clutter up cache space.)

There are various paths that I could take, but which ones I want to take depends on what my goals are and I'm actually not entirely certain about that. If my goal is to serve responses to unconditional GETs that are as fresh as possible but come from cache for as long as possible, what I should probably do is make conditional GETs update the cache when the cached version of the feed exists and would still have been served to an unconditional GET. I've already paid the cost to dynamically generate the feed, so I might as well serve it to unconditional GET requests. However, in my current cache architecture this would have the side effect of causing conditional GETs to get that newly updated cached copy for the conditional GET cache validity period, instead of generating the very latest feed dynamically (what would happen today).

(A sleazy approach would be to backdate the newly updated cache entry by the conditional GET validity interval. My current code architecture doesn't allow for that, so I can avoid the temptation.)

On the other hand, the entire reason I have a different (and longer) cache validity interval for unconditional GET requests is that in some sense I want to punish them. It's a deliberate feature that unconditional GETs receive stale responses, and in some sense the more stale the response the better. Even though updating the cache with a current response I've already generated is in some sense free, doing it cuts against this goal, both in general and in specific. In practice, Wandering Thoughts sees frequent enough conditional GETs for syndication feeds that making conditional GETs refresh the cached feed would effectively collapse the two cache validity intervals into one, which I can already do without any code changes. So if this is my main goal for cache handling of unconditional GETs of my syndication feed, the current state is probably fine and there's nothing to fix.

(A very approximate number is that about 15% of the syndication feed requests to Wandering Thoughts are unconditional GETs. Some of the offenders should definitely know and do better, such as 'Slackbot 1.0'.)

Syndication feed fetchers and their behavior on HTTP 429 status responses

By: cks
11 November 2024 at 04:09

For reasons outside of the scope of this entry, recently I've been looking at the behavior of syndication feed fetchers here on Wandering Thoughts (which are generally from syndication feed readers), and in the process I discovered some that were making repeated requests at a quite aggressive rate, such as every five minutes. Until recently there was some excuse for this, because I wasn't setting a 'Cache-Control: max-age=...' header (also), which is (theoretically) used to tell Atom feed fetchers how soon they should re-fetch. I feel there was not much of an excuse because no feed reader should default to fetching every five minutes, or even every fifteen, but after I set my max-age to an hour there definitely should be no excuse.

Since sometimes I get irritated with people like this, I arranged to start replying to such aggressive feed featchers with a HTTP 429 "Too Many Requests" status response (the actual implementation is a hack because my entire software is more or less stateless, which makes true rate limiting hard). What I was hoping for is that most syndication feed fetching software would take this as a signal to slow down how often it tried to fetch the feed, and I'd see excessive sources move from one attempt every five minutes to (much) slower rates.

That basically didn't happen (perhaps this is no surprise). I'm sure there's good syndication feed fetching software that probably would behave that way on HTTP 429 responses, but whatever syndication feed software was poking me did not react that way. As far as I can tell from casually monitoring web access logs, almost no mis-behaving feed software paid any attention to the fact that it was specifically getting a response that normally means "you're doing this too fast". In some cases, it seems to have caused programs to try to fetch even more than before.

(Perhaps some of this is because I didn't add a 'Retry-After' header to my HTTP 429 responses until just now, but even without that, I'd expect clients to back off on their own, especially after they keep getting 429s when they retry.)

Given the HTTP User-Agents presented by feed fetchers, some of this is more or less expected, for two reasons. First, some of the User-Agents are almost certainly deliberate lies, and if a feed crawler is going to actively lie about what it is there's no reason for it to respect HTTP 429s either. Second, some of the feed fetching is being done by stateless programs like curl, where the people building ad-hoc feed fetching systems around them would have to go (well) out of their way to do the right thing. However, a bunch of the aggressive feed fetching is being done by either real feed fetching software with a real user-agent (such as "RSS Bot" or the Universal Feed Parser) or by what look like browser addons running in basically current versions of Firefox. I'd expect both of these to respect HTTP 429s if they're programmed decently. But then, if they were programmed decently they probably wouldn't be trying every five minutes in the first place.

(Hopefully the ongoing feed reader behavior project by rachelbythebay will fix some of this in the long run; there are encouraging signs, as covered in eg the October 25th score report.)

Keeping your site accessible to old browsers is non-trivial

By: cks
31 October 2024 at 03:13

One of the questions you could ask about whether or not to block HTTP/1.0 requests is what this does to old browsers and your site's accessibility to (or from) them (see eg the lobste.rs comments on my entry). The reason one might care about this is that old systems can usually only use old browsers, so to keep it possible to still use old systems you want to accommodate old browsers. Unfortunately the news there is not really great, and taking old browsers and old systems seriously has a lot of additional effects.

The first issue is that old systems generally can't handle modern TLS and don't recognize modern certificate authorities, like Let's Encrypt. This situation is only going to get worse over time, as websites increasingly require TLS 1.2 or better (and then in the future, TLS 1.3 or better). If you seriously care about keeping your site accessible to old browsers, you need to have a fully functional HTTP version. Increasingly, it seems that modern browsers won't like this, but so far they're willing to put up with it. I don't know if there's any good way to steer modern visitors to your HTTPS version instead of your HTTP version.

(This is one area where modern browsers preemptively trying HTTPS may help you.)

Next, old browsers obviously only support old versions of CSS, if they have very much CSS support at all (very old browsers probably won't). This can present a real conflict; you can have an increasingly basic site design that sticks within the bounds of what will render well on old browsers, or you can have one that looks good to what's probably the vast majority of your visitors and may or may not degrade gracefully on old browsers. Your CSS, if any, will probably also be harder to write, and it may be hard to test how well it actually works on old browsers. Some modern accessibility features, such as adjusting to screen sizes, may be (much) harder to get. If you want a multi-column layout or a sidebar, you're going to be back in the era of table based layouts (which this blog has never left, mostly because I'm lazy). And old browsers also mean old fonts, although with fonts it may be easier to degrade gracefully down to whatever default fonts the browser has.

(If you use images, there's the issue of image sizes and image formats. Old browsers are generally used on low resolution screens and aren't going to be the fastest or the best at scaling images down, if you can get them to do it as well. And you need to stick to image formats that they support.)

It's probably not impossible to do all of this, and you can test some of it by seeing how your site looks in text mode browsers like Lynx (which only really supports HTTP/1.0, as it turns out). But's certainly constraining; you have to really care, and it will cut you off from some things that are important and useful.

PS: I'm assuming that if you intend to be as fully usable as possible by old browsers, you're not even going to try to have JavaScript on your site.

The question of whether to still allow HTTP/1.0 requests or block them

By: cks
29 October 2024 at 02:28

Recently, I discovered something and noted it on the Fediverse:

There are still a small number of things making HTTP/1.0 requests to my techblog. Many of them claim to be 'Chrome/124.<something>'. You know, I don't think I believe you, and I'm not sure my techblog should still accept HTTP/1.0 requests if all or almost all of them are malicious and/or forged.

The pure, standards-compliant answer to this is that of course you should still allow HTTP/1.0 requests. It remains a valid standard, and apparently some things may still default to it, and one part of the web's strength is its backward compatibility.

The pragmatic answer starts with the observation that HTTP/1.1 is now 25 years old, and any software that is talking HTTPS to you is demonstrably able to deal with standards that are more recent than that (generally much more recent, as sites require TLS 1.2 or better). And as a practical matter, pure HTTP/1.0 clients can't talk to many websites because such websites are name-based virtual hosts where the web server software absolutely requires a HTTP Host header before it will serve the website to you. If you leave out the Host header, at best you will get some random default site, perhaps a stub site.

(In a HTTPS context, web servers will also require TLS SNI and some will give you errors if the HTTP Host doesn't match the TLS SNI or is missing entirely. These days this causes HTTP/0.9 requests to be not very useful.)

If HTTP/1.0 requests were merely somewhere between a partial lie (in that everything that worked was actually supplying a Host header too) and useless (for things that didn't supply a Host), you could simply leave them be, especially if the volume was low. But my examination suggests strongly that approximately everything that is making HTTP/1.0 requests to Wandering Thoughts is actually up to no good; at a minimum they're some form of badly coded stealth spiders, quite possibly from would-be comment spammers that are trawling for targets. On a spot check, this seems to be true of another web server as well.

(A lot of the IPs making HTTP/1.0 requests provide claimed User-Agent headers that include ' Not-A.Brand/99 ', which appears to have been a Chrome experiment in putting random stuff in the User-Agent header. I don't see that in modern real Chrome user-agent strings, so I believe it's been dropped or de-activated since then.)

My own answer is that for now at least, I've blocked HTTP/1.0 requests to Wandering Thoughts. I'm monitoring what User-Agents get blocked, partly so I can perhaps exempt some if I need to, and it's possible I'll rethink the block entirely.

(Before you do this, you should certainly look at your own logs. I wouldn't expect there to be very many real HTTP/1.0 clients still out there, but the web has surprised me before.)

The importance of name-based virtual hosts (websites)

By: cks
27 October 2024 at 03:25

I recently read Geoff Huston's The IPv6 Transition, which is actually about why that transition isn't happening. A large reason for that is that we've found ways to cope with the shortage of IPv4 addresses, and one of the things Huston points to here is the introduction of the TLS Server Name Indicator (SNI) as drastically reducing the demand for IPv4 addresses for web servers. This is a nice story, but in actuality, TLS SNI was late to the party. The real hero (or villain) in taming what would otherwise have been a voracious demand for IPv4 addresses for websites is the HTTP Host header and the accompanying idea of name-based virtual hosts. TLS SNI only became important much later, when a mass movement to HTTPS hosts started to happen, partly due to various revelations about pervasive Internet surveillance.

In what is effectively the pre-history of the web, each website had to have its own IP(v4) address (an 'IP-based virtual host', or just your web server). If a single web server was going to support multiple websites, it needed a bunch of IP aliases, one per website. You can still do this today in web servers like Apache, but it has long since been superseded with name-based virtual hosts, which require the browser to send a Host: header with the other HTTP headers in the request. HTTP Host was officially added in HTTP/1.1, but I believe that back in the days basically everything accepted it even for HTTP 1.0 requests and various people patched it into otherwise HTTP/1.0 libraries and clients, possibly even before HTTP/1.1 was officially standardized.

(Since HTTP/1.1 dates from 1999 or so, all of this is ancient history by now.)

TLS SNI only came along much later. The Wikipedia timeline suggests the earliest you might have reasonably been able to use it was in 2009, and that would have required you to use a bleeding edge Apache; if you were using an Apache provided by your 'Long Term Support' Unix distribution, it would have taken years more. At the time that TLS SNI was introduced this was okay, because HTTPS (still) wasn't really seen as something that should be pervasive; instead, it was for occasional high-importance sites.

One result of this long delay for TLS SNI is that for years, you were forced to allocate extra IPv4 addresses and put extra IP aliases on your web servers in order to support multiple HTTPS websites, while you could support all of your plain-HTTP websites from a single IP. Naturally this served as a subtle extra disincentive to supporting HTTPS on what would otherwise be simple name-based virtual hosts; the only websites that it was really easy to support were ones that already had their own IPs (sometimes because they were on separate web servers, and sometimes for historical reasons if you'd been around long enough, as we had been).

(For years we had a mixed tangle of name-based and ip-based virtual hosts, and it was often difficult to recover the history of just why something was ip-based instead of name-based. We eventually managed to reform it down to only a few web servers and a few IP addresses, but it took a while. And even today we have a few virtual hosts that are deliberately ip-based for reasons.)

A Small Compendium of Fediverse Platforms I Use

12 September 2024 at 16:45
After revisiting my old Fediverse instances and helping friends set up new ones, I took the chance to update and evaluate several platforms. Here’s my experience with Akkoma, GoToSocial, Mitra, Snac2, and Mastodon.

Syndication feed readers now seem to leave Last-Modified values alone

By: cks
18 October 2024 at 03:08

A HTTP conditional GET is a way for web clients, such as syndication feed readers, to ask for a new copy of a URL only if the URL has changed since they last fetched it. This is obviously appealing for things, like syndication feed readers, that repeatedly poll URLs that mostly don't change, although syndication feed readers not infrequently get parts of this wrong. When a client makes a conditional GET, it can present an If-Modified-Since header, an If-None-Match header, or both. In theory, the client's If-None-Match value comes from the server's ETag, which is an opaque value, and the If-Modified-Since comes from the server's Last-Modified, which is officially a timestamp but which I maintain is hard to compare except literally.

I've long believed and said that many clients treat the If-Modified-Since header as a timestamp and so make up their own timestamp values; one historical example is Tiny Tiny RSS, and another is NextCloud-News. This belief led me to consider pragmatic handling of partial matches for HTTP conditional GET, and due to writing that entry, it also led me to actually instrument DWiki so I could see when syndication feed clients presented If-Modified-Since timestamps that were after my feed's Last-Modified. The result has surprised me. Out of the currently allowed feed fetchers, almost no syndication feed fetcher seems to present its own, later timestamp in requests, and on spot checks, most of them don't use too-old timestamps either.

(Even Tiny Tiny RSS may have changed its ways since I last looked at its behavior, although I'm keeping my special hack for it in place for now.)

Out of my reasonably well behaved, regular feed fetchers (other than Tiny Tiny RSS), only two uncommon ones regularly present timestamps after my Last-Modified value. And there are a lot of different User-Agents that managed to do a successful conditional GET of my syndication feed.

(There are, unfortunately, quite a lot of User-Agents that fetched my feed but didn't manage even a single successful conditional GET. But that's another matter, and some of them may have an extremely low polling interval. It would take me a lot more work to correlate this with which requests didn't even try any conditional GETs.)

This genuinely surprises me, and means I have to revise my belief that everyone mangles If-Modified-Since. Mostly they don't. As a corollary, parsing If-Modified-Since strings into timestamps and doing timestamp comparisons on them is probably not worth it, especially if Tiny Tiny RSS has genuinely changed.

(My preliminary data also suggests that almost no one has a different timestamp but a matching If-None-Match value, so my whole theory on pragmatic partial matches is irrelevant. As mentioned in an earlier entry, some feed readers get it wrong the other way around.)

PS: I believe that rachelbythebay's more systematic behavioral testing of feed readers has unearthed a variety of feed readers that have more varied If-Modified-Since behavior than I'm seeing; see eg this recent roundup. So actual results on your website may vary significantly depending on your readers and what they use.

Potential pragmatic handling of partial matches for HTTP conditional GET

By: cks
12 October 2024 at 02:02

In HTTP, a conditional GET is a GET request that potentially can be replied with a HTTP '304 Not Modified' status; this is quite useful for polling relatively unchanging resources like syndication feeds (although syndication feed readers don't always do so well at it). Generally speaking, there are two potential validators for conditional GET requests; the If-None-Match header, validated against the ETag of the reply, and the If-Modified-Since header, validated against the Last-Modified of the reply. A HTTP client can remember and use either or both of your ETag and your Last-Modified values (assuming you provide both).

When a HTTP client sends both If-Modified-Since and If-None-Match, the fully correct, specifications compliant validation is to require both to match. This makes intuitive sense; both your ETag and your Last-Modified values are part of the state of what you're replying with, and if one doesn't match, the client has a different view of the URL's state than you do so you shouldn't claim it's 'not modified' from their state. Instead you should return the entire response so that they can update their view of your Last-Modified state.

In practice, two things potentially get in the way. First, it's common for syndication feed readers and other things to treat the 'If-Modified-Since' value they provide as a timestamp, not as an opaque string that echoes back your previous Last-Modified. Programs will put in what's probably some default time value, they'll use timestamps from internal events, and various other fun things. By contrast, your ETag value is opaque and has no meaning for programs to interpret, guess at, and make up; if a HTTP client sends an ETag, it's very likely to be one you provided (although this isn't certain). Second, it's not unusual for your ETag to be a much stronger validator than your Last-Modified; for example, your ETag may be a cryptographic hash of the contents and will definitely change if they do, while your Last-Modified is an imperfect approximation and may not change even if the content does.

In this situation, if a client presents an If-None-Match header that matches your current ETag and a If-Modified-Since that doesn't match your Last-Modified, it's extremely likely that they have your current content but have done one of the many things that make their 'timestamp' not match your Last-Modified. If you know you have a strong validator in your ETag and they're doing something like fetching your syndication feed (where it's very likely that they're going to do this a lot), it's pragmatically tempting to give them a HTTP 304 response even though you're technically not supposed to.

To reduce the temptation, you can change to comparing your Last-Modified value against people's If-Modified-Since as a timestamp (if you can parse their value that way), and giving people a HTTP 304 response if their timestamp is equal to or after yours. This is what I'd do today given how people actually handle If-Modified-Since, and it would work around many of the bad things that people do with If-Modified-Since (since usually they'll create timestamps that are more recent than your Last-Modified, although not always).

Despite everything I've written above, I don't know if this happens all that often. It's entirely possible that syndication feed readers and other programs that invent things for their If-Modified-Since values are also not using If-None-Match and ETag values. I've recently added instrumentation to the software here so that I can tell, so maybe I'll have more to report soon.

(If I was an energetic person I would hunt through the data that rachelbythebay has accumulated in their feed reader behavioral testing project to see what it has to say about this (the most recent update for which is here and I don't know of an overall index, see their archives). However, I'm not that energetic.)

Things syndication feed readers do with 'conditional GET'

By: cks
8 October 2024 at 02:54

In HTTP, a conditional GET is a nice way of saving bandwidth (but not always work) when a web browser or other HTTP agent requests a URL that hasn't changed. Conditional GET is very useful for things that fetch syndication feeds (Atom or RSS), because they often try fetches much more often than the syndication feed actually changes. However, just because it would be a good thing if feed readers and other things did conditional GETs to fetch feeds doesn't mean that they actually do it. And when feed readers do try conditional GETs, they don't always do it right; for instance, Tiny Tiny RSS at least used to basically make up the 'If-Modified-Since' timestamps it sent (which I put in a hack for).

For reasons beyond the scope of this entry, I recently looked at my feed fetching logs for Wandering Thoughts. As usually happens when you turn over any rock involving web server logs, I discovered some multi-legged crawling things underneath, and in this case I was paying attention to what feed readers do (or don't do) for conditional GETs. Consider this a small catalog.

  • Some or perhaps all versions of NextCloud-News send an If-Modified-Since header with the value 'Wed, 01 Jan 1800 00:00:00 GMT'. This is always going to fail validation and turn into a regular GET request, whether you compare If-Modified-Since values literally or consider them as a timestamp and do timestamp comparisons. NextCloud-News might as well not bother sending an If-Modified-Since header at all.

  • A number of feed readers appear to only update their stored ETag value for your feed if your Last-Modified value also changes. In practice there are a variety of things that can change the ETag without changing the Last-Modified value, and some of them regularly happen here on Wandering Thoughts, which causes these feed readers to effectively decay into doing unconditional GET requests the moment, for example, someone leaves a new comment.

  • One feed reader sends If-Modified-Since values that use a numeric time offset, as in 'Mon, 07 Oct 2024 12:00:07 -0000'. This is also not a reformatted version of a timestamp I've ever given out, and is after the current Last-Modified value at the time the request was made. This client reliably attempts to pull my feed three times a day, at 02:00, 08:00, and 20:00, and the times of the If-Modified-Since values for those fetches are reliably 00:00, 06:00, and 12:00 respectively.

    (I believe it may be this feed fetcher, but I'm not going to try to reverse engineer its If-Modified-Since generation.)

  • Another feed fetcher, possibly Firefox or an extension, made up its own timestamps that were set after the current Last-Modified of my feed at the time it made the request. It didn't send an If-None-Match header on its requests (ie, it didn't use the ETag I return). This is possibly similar to the Tiny Tiny RSS case, with the feed fetcher remembering the last time it fetched the feed and using that as the If-Modified-Since value when it makes another request.

All of this is what I turned over in a single day of looking at feed fetchers that got a lot of HTTP 200 results (as opposed to HTTP 304 results, which shows a conditional GET succeeding). Probably there are more fun things lurking out there.

(I'm happy to have people read my feeds and we're not short on bandwidth, so this is mostly me admiring the things under the rock rather than anything else. Although, some feed readers really need to slow down the frequency of their checks; my feed doesn't update every few minutes.)

My "time to full crawl" (vague) metric

By: cks
18 September 2024 at 02:43

This entry, along with all of Wandering Thoughts (this blog) and in fact the entire wiki-thing it's part of is dynamically rendered from my wiki-text dialect to HTML. Well, in theory. In practice, one of the several layers of caching that make DWiki (this software) perform decently is a cache of the rendered HTML. Because DWiki is often running as an old fashioned Apache CGI, this rendering cache lives on disk.

(DWiki runs in a complicated way that can see it operating as a CGI under low load or as a daemon with a fast CGI frontend under higher load; this entry has more details.)

Since there are only so many things to render to HTML, this on disk cache has a maximum size that it stabilizes at; given enough time, everything gets visited and thus winds up in the disk cache of rendered HTML. The render disk cache lives in its own directory hierarchy, and so I can watch its size with a simple 'du -hs' command. Since I delete the entire cache every so often, this gives me an indicator that I can call either "time to full cache" or "time to full crawl". The time to full cache is how long it typically takes for the cache to reach maximum size, which is how long it takes for everything to be visited by something (or actually, used to render a URL that something visited).

I haven't attempted to systematically track this measure, but when I've looked it usually takes less than a week for the render cache to reach its stable 'full' size. The cache stores everything in separate files, so if I was an energetic person I could scan through the cache's directory tree, look at the file modification times, and generate some nice graphs of how fast the crawling goes (based on either the accumulated file sizes or the accumulated number of files, depending on what I was interested in).

(In theory I could do this from web server access logs. This would give me a somewhat different measure, since I'd be tracking what URLs had been accessed at least once instead of which bits of wikitext had been used in displaying URLs. At the same time, it might be a more interesting measure of how fast things are visited, and I do have a catalog of all page URLs here in the form of an automatically generated sitemap.)

PS: I doubt this is a single crawler visiting all of Wandering Thoughts in a week or so. Instead I expect it's the combination of the assorted crawlers (most of them undesirable), plus some amount of human traffic.

Apache's odd behavior for requests with a domain with a dot at the end

By: cks
3 September 2024 at 03:16

When I wrote about the fun fact that domains can end in dots and how this affects URLs, I confidently said that Wandering Thoughts (this blog) reacted to being requested through 'utcc.utoronto.ca.' (with a dot at the end) by redirecting you to the canonical form, without the final dot. Then in comments, Alex reported that they got a Apache '400 Bad Request' response when they did it. From there, things got confusing (and are still confusing).

First, this response is coming from Apache, not DWiki (the code behind the blog). You can get the same '400 Bad Request' response from https://utcc.utoronto.ca./~cks/ (a static file handled only by this host's Apache). Second, you don't always get this response; what happens depends on what you're using to access the URL. Here's what I've noticed and tested so far:

  • In some tools you'll get a TLS certificate validation failure due to a name mismatch, presumably because 'utcc.utoronto.ca.' doesn't match 'utcc.utoronto.ca'. GNU Wget2 behaves this way.

    (GNU Wget version 1.x doesn't seem to have this behavior; instead I think it may strip the final '.' off before doing much processing. My impression is that GNU Wget2 and 'GNU Wget (1.x)' are fairly different programs.)

  • on some Apache configurations, you'll get a TLS certificate validation error from everything, because Apache apparently doesn't think that that the 'dot at end' version of the host name matches any of its configured virtual host names, and so it falls back to a default TLS certificate that doesn't match what you asked for.

    (This doesn't happen with this host's Apache configuration but it does happen on some other ones I tested with.)

  • against this host's Apache, at least lynx, curl, Safari on iOS (to my surprise), and manual testing all worked, with the request reaching DWiki and DWiki then generating a redirect to the canonical hostname. By a manual test, I mean making a TLS connection to port 443 with a tool of mine and issuing:

    GET /~cks/space/blog/ HTTP/1.0
    Host: utcc.utoronto.ca.
    

    (And no other headers, although a random User-Agent doesn't seem to affect things.)

  • Firefox and I presume Chrome get the Apache '400 Bad Request' error (I don't use Chrome and I'm not going to start for this).

I've looked at the HTTP headers that Firefox's web developer tools says it's sending and they don't look particularly different or unusual. But something is getting Apache to decide this is a bad request.

(It's possible that some modern web security related headers are triggering this behavior in Apache, and only a few major browsers are sending them. I am a little bit surprised that Safari on iOS doesn't trigger this.)

The web fun fact that domains can end in dots and canonicalization failures

By: cks
30 August 2024 at 02:58

Recently, my section of the Fediverse learned that the paywall of a large US-based news company could be bypassed simply by putting a '.' at the end of the website name. That is to say, you asked for 'https://newssite.com./article' instead of 'https://newssite.com/article'. People had a bit of a laugh (myself included) and also sympathized, because this is relatively obscure DNS trivia. Later, I found myself with a bit of a different view, which is that this is a failure of canonicalization in the web programming and web server environment.

(One theory for how this issue could happen is that the news company runs multiple sites from the same infrastructure and wants the paywall to only apply to some of them. Modern paywalls are relatively sophisticated programming, so I can easily imagine listing off the domains that should be affected by the paywall and missing the 'domain.' forms, perhaps because the people doing the programming simply don't know that bit of trivial.)

At the textual level, there are a lot of ways to vary host names and URLs. Hostnames are case independent, characters in URLs can be %-encoded, and so on (and I'm going to leave out structural modifications like '/./' and '/../' URL path elements or adding random query parameters). Web programming and web server environments already shield people from some of those by default; for example, if you configure a name-based virtual host, I think basically every web server will treat the name you provided as a case-independent one. Broadly we can consider this as canonicalizing the URL and other HTTP request information for you, so that you don't have to do it and thus you don't have to know all of the different variations that are possible.

It's my view that this canonicalization should also happen for host and domain names with dots at the end. Your web programming code should not have to even care about the possibility by default, any more than you probably have to care about it when configuring virtual hosts. If you really wanted to know low-level details about the request you should be able to, but the normal, easily accessible information you use for comparing and matching and so on should be canonicalized for you. This way it can be handled once by experts who know all of the crazy things that can appear in URLs, instead of repeatedly by web programmers who don't.

(Because if we make everyone handle this themselves we already know what's going to happen; some of them won't, and then we'll get various sorts of malfunctions, bugs, and security issues.)

PS: I've probably written some web related code that gets this wrong, treating 'domain.' and 'domain' as two separate things (and so probably denying access to the 'domain.' form as an unknown host). In fact if you try this here on Wandering Thoughts, you'll get a redirection to the dotless version of the domain, but this is because I put in a general 'redirect all weird domain variations to the canonical domain' feature a long time ago.

(My personal view is that redirecting to the canonical form of the domain is a perfectly valid thing to do in this situation.)

Cool URLs Mean Something

By: Nick Heer
1 August 2024 at 03:55

Tim Berners-Lee in 1998:

Keeping URIs so that they will still be around in 2, 20 or 200 or even 2000 years is clearly not as simple as it sounds. However, all over the Web, webmasters are making decisions which will make it really difficult for themselves in the future. Often, this is because they are using tools whose task is seen as to present the best site in the moment, and no one has evaluated what will happen to the links when things change. The message here is, however, that many, many things can change and your URIs can and should stay the same. They only can if you think about how you design them.

Jay Hoffmann:

Links give greater meaning to our webpages. Without the link, we would lose this significant grammatical tool native the web. And as links die out and rot on the vine, what’s at stake is our ability to communicate in the proper language of hypertext.

A dead link may not seem like it means very much, even in the aggregate. But they are. One-way links, the way they exist on the web where anyone can link to anything, is what makes the web universal. In fact, the first name for URL’s was URI’s, or Universal Resource Identifier. It’s right there in the name. And as Berners-Lee once pointed out, β€œits universality is essential.”

In 2018, Google announced it was deprecating its URL shortener, with no new links being created after March 2019. All existing shortened links would, however, remain active. It announced this in a developer blog post which β€” no joke β€” returns a 404 error at its original URL, which I found via 9to5Google. Google could not bother to redirect posts from just six years ago to their new valid URLs.

Google’s URL shortener was in the news again this month because the company has confirmed it will turn off these links in August 2025 except for those created via Google’s own apps. Google Maps, for example, still creates a goo.gl short link when sharing a location.

In principle, I support this deprecation because it is confusing and dangerous for Google’s own shortened URLs to have the same domain as ones created by third-party users. But this is a Google-created problem because it designed its URLs poorly. It should have never been possible for anyone else to create links with the same URL shortener used by Google itself. Yet, while it feels appropriate for a Google service to be unreliable over a long term, it also should not be ending access to links which may have been created just about five years ago.

By the way, the Sophos link on the word β€œdangerous” in that last paragraph? I found it via a ZDNet article where the inline link is β€” you guessed it β€” broken. Sophos also could not bother to redirect this URL from 2018 to its current address. Six years ago! Link rot is a scourge.

βŒ₯ Permalink

Third-Party Cookies Have Got to Go

By: Nick Heer
30 July 2024 at 02:26

Anthony Chavez, of Google:

[…] Instead of deprecating third-party cookies, we would introduce a new experience in Chrome that lets people make an informed choice that applies across their web browsing, and they’d be able to adjust that choice at any time. We’re discussing this new path with regulators, and will engage with the industry as we roll this out.

Oh good β€” more choices.

Hadley Beeman, of the W3C’s Technical Architecture Group:

Third-party cookies are not good for the web. They enable tracking, which involves following your activity across multiple websites. They can be helpful for use cases like login and single sign-on, or putting shopping choices into a cart β€” but they can also be used to invisibly track your browsing activity across sites for surveillance or ad-targeting purposes. This hidden personal data collection hurts everyone’s privacy.

All of this data collection only makes sense to advertisers in the aggregate, but it only works because of specifics: specific users, specific webpages, and specific actions. Privacy Sandbox is imperfect but Google could have moved privacy forward by ending third-party cookies in the world’s most popular browser.

βŒ₯ Permalink

βŒ₯ Engineering Consent

By: Nick Heer
30 July 2024 at 02:02

Anthony Ha, of TechCrunch, interviewed Jean-Paul Schmetz, CEO of Ghostery, and I will draw your attention to this exchange:

AH I want to talk about both of those categories, Big Tech and regulation. You mentioned that with GDPR, there was a fork where there’s a little bit of a decrease in tracking, and then it went up again. Is that because companies realized they can just make people say yes and consent to tracking?

J-PS What happened is that in the U.S., it continued to grow, and in Europe, it went down massively. But then the companies started to get these consent layers done. And as they figured it out, the tracking went back up. Is there more tracking in the U.S. than there is in Europe? For sure.

AH So it had an impact, but it didn’t necessarily change the trajectory?

J-PS It had an impact, but it’s not sufficient. Because these consent layers are basically meant to trick you into saying yes. And then once you say yes, they never ask again, whereas if you say no, they keep asking. But luckily, if you say yes, and you have Ghostery installed, well, it doesn’t matter, because we block it anyway. And then Big Tech has a huge advantage because they always get consent, right? If you cannot search for something in Google unless you click on the blue button, you’re going to give them access to all of your data, and you will need to rely on people like us to be able to clean that up.

The TechCrunch headline summarizes this by saying β€œregulation won’t save us from ad trackers”, but I do not think that is a fair representation of this argument. What it sounds like, to me, is that regulations should be designed more effectively.

The E.U.’s ePrivacy Directive and GDPR have produced some results: tracking is somewhat less pervasive, people have a right to data access and portability, and businesses must give users a choice. That last thing is, as Schmetz points out, also its flaw, and one it shares with something like App Tracking Transparency on iOS. Apps affected by the latter are not permitted to keep asking if tracking is denied, but they do similarly rely on the assumption a user can meaningfully consent to a cascading system of trackers.

In fact, the similarities and differences between cookie banner laws and App Tracking Transparency are considerable. Both require some form of consent mechanism immediately upon accessing a website or an app, assuming a user can provide that choice. Neither can promise tracking will not occur should a user deny the request. Both are interruptive.

But cookie consent laws typically offer users more information; many European websites, for example, enumerate all their third-party trackers, while App Tracking Transparency gives users no visibility into which trackers will be allowed. The latter choice is remembered forever unless a user removes and reinstalls the app, while websites can ask you for cookie consent on each visit. Perhaps the latter may sometimes be a consequence of using Safari; it is hard to know.

App Tracking Transparency also has a system-wide switch to opt out of all third-party tracking. There used to be something similar in web browsers, but compliance was entirely optional. Its successor effort, Global Privacy Control, is sadly not as widely supported as it ought to be, but it appears to have legal teeth.

Both of these systems have another important thing in common: neither are sufficiently protective of users’ privacy because they burden individuals with the responsibility of assessing something they cannot reasonably comprehend. It is patently ridiculous to put the responsibility on individuals to mitigate a systemic problem like invasive tracking schemes.

There should be a next step to regulations like these because user tracking is not limited to browsers where Ghostery can help β€” if you know about it. A technological response is frustrating and it is unclear to me how effective it is on its own. This is clearly not a problem only regulation can solve but neither can browser extensions. We need both.

Modern web PKI (TLS) is very different than it used to be

By: cks
3 August 2024 at 02:45

In yesterday's entry on the problems OCSP Stapling always faced, I said that OCSP Stapling felt like something from an earlier era of the Internet. In a way, this is literally true. The OCSP Stapling RFC was issued in January 2011, so the actual design work is even older. In 2011, Let's Encrypt was a year away from being started and the Snowden leaks about pervasive Internet interception (and 'SSL added and removed here') had not yet happened. HTTPS was a relative luxury, primarily deployed for security sensitive websites such as things that you had to log in to (and even that wasn't universal). Almost all Certificate Authorities charged money (and the ones that had free certificates sometimes failed catastrophically), the shortest TLS certificate you could get generally lasted for a year, and there were probably several orders of magnitude fewer active TLS certificates than there are today.

(It was also a different world in that browsers were much more tolerant of Certificate Authority misbehavior, so much so that I could write that I couldn't think of a significant CA that had been de-listed by browsers.)

The current world of web PKI is a very different place from that. Let's Encrypt, the current biggest CA, currently has almost 380 million active TLS certificates, HTTPS is increasingly expected and required by people and browsers (in order to enable various useful new bits of Javascript and so on), and a large portion of web traffic is HTTPS instead of HTTP. For good reasons, it's become well understood that everything should be HTTPS if at all possible. Commercial Certificate Authorities (ones that charge money for TLS certificates) face increasingly hard business challenges, since Let's Encrypt is free, and even their volume is probably up. With HTTPS connections being dominant, everything related to that is now on the critical path to them working and being speedy, placing significant demands on things like OCSP infrastructure.

(These demands would be much, much worse if Chrome, the dominant browser, checked the OCSP status of certificates. We don't really have an idea of how many CAs could stand up to that volume and how much it would cost them.)

In the before world of 2011, being a Certificate Authority was basically a license to print money if you could manage some basic business and operations competence. In the modern world of 2024, being a general Certificate Authority is a steadily increasing money sink with a challenging business model.

OCSP Stapling always faced a bunch of hard problems

By: cks
2 August 2024 at 03:41

One reaction to my entry about how the Online Certificate Status Protocol (OCSP) is basically dead is to ask why OCSP Stapling was abandoned along with OCSP and why it didn't catch on. The answer, which will not please people who liked OCSP Stapling, is that OCSP Stapling was always facing a bunch of hard problems and it's not really a surprise that it failed to overcome them.

If OCSP Stapling was to really deliver serious security improvements, it had to be mandatory in the long run. Otherwise someone who had a stolen and revoked certificate could just use the certificate without any stapling and have you fall back to trusting it. The OCSP standard provided a way to do this, in the form of the 'OCSP Must Staple' option that you or your Certificate Authority could set in the signed TLS certificate. The original plan with OCSP Stapling was that it would just be an optimization to basic OCSP, but since basic OCSP turned out to be a bad idea and is now dead, OCSP Stapling must stand on its own. As a standalone thing, I believe that OCSP Stapling has to eventually require stapling, with CAs normally or always providing TLS certificates that set the 'must staple' option.

Getting a web server to do OCSP Stapling requires both software changes and operational changes. The basic TLS software has to provide stapled OCSP responses, getting them from somewhere, and then there has to be something that fetches signed OCSP responses from the CA periodically and stores them so that the TLS software could use them. There are a lot of potential operational changes here, because your web server may go from a static frozen thing that does not need to contact things in the outside world or store local state to something that needs to do both. Alternately, maybe you need to build an external system to fetch OCSP responses and inject them into the static web server environment, in much the same way that you periodically have to inject new TLS certificates.

(You could also try to handle this at the level of TLS libraries, but things rapidly get challenging and many people will be unhappy if their TLS library starts creating background threads that call out to Certificate Authority sites.)

There's a lot of web server software out there, with its development moving at different speeds, plus people then have to get around to deploying the new versions, which may literally take a decade or so. There are also a whole lot of people operating web servers, in a widely varied assortment of environments and with widely varied level of both technical skill and available time to change how they operate. And in order to get people to do all of this work, you have to persuade them that it's worth it, which was not helped by early OCSP stapling software having various operational issues that could make enabling OCSP stapling worse than not doing so.

(Some of these environments are very challenging to operate in or change. For example, there are environments where what is doing TLS is an appliance which only offers you the ability to manually upload new TLS certificates, and which is completely isolated from the Internet by design. A typical example is server management processors and server BMC networks. Organizations with such environments were simply not going to accept TLS certificates that required a weekly, hands-on process (to load a new set of OCSP responses) or giving BMCs access to the Internet.)

All of this created a situation where OCSP Stapling never gathered a critical mass of adoption. Software for it was slow to appear and balky when it did appear, many people did not bother to set stapling up even when what they were using eventually supported it, and it was pretty clear to everyone that there was little benefit to setting up OCSP stapling (and it was dangerous if you told your CA to give you TLS certificates with OCSP Must Staple set).

Looking back, OCSP Stapling feels like something designed for an earlier Internet, one that was both rather smaller and much more agile about software and software deployment. In the (very) early Internet you really could roll out a change like this and have it work relatively well. But by the time OCSP Stapling was being specified, the Internet was lot like that any more.

PS: As noted in the comments on my entry on OCSP's death, another problem with OCSP Stapling is that if used pervasively, it effectively requires CAs to create and sign a large number of mini-certificates on a roughly weekly basis, in the form of (signed) OCSP responses. These signed responses aren't on the critical latency path of web browser requests, but they do have to be reliable. The less reliable CAs are about generating them, the sooner web servers will try to renew them (for extra safety margin if it takes several attempts), adding more load.

The Online Certificate Status Protocol (OCSP) is basically dead now

By: cks
25 July 2024 at 03:16

The (web) TLS news of the time interval is that Let's Encrypt intends to stop doing OCSP more or less as soon as Microsoft will let them. Microsoft matters because they are apparently the last remaining major group that requires Certificate Authorities to support OCSP in order for the CA's TLS root certificates to be supported. This is functionally the death declaration for OCSP, including OCSP stapling.

(The major '(TLS) root programs' are all associated with browsers and major operating systems; Microsoft for Windows and Edge, Apple for macOS, iOS, and Safari, Google for Chrome and Android, and Mozilla for Firefox and basically everyone else.)

Let's Encrypt is only one TLS Certificate Authority so in theory other CAs could keep on providing OCSP. However, LE is the dominant TLS CA, responsible for issuing a very large number of the active TLS certificates, plus CAs don't like doing OCSP anyway because it takes a bunch of resources (since you have to be prepared for a lot of browsers and devices to ask you for the status of things). Also, as a practical matter OCSP has been mostly dead for a long time because Chrome hasn't supported OCSP for years, which means that only a small amount of traffic will be affected by the OCSP status of TLS certificates used for the web (which has periodically led to OCSP breaking and causing problems for people using browsers that do check, like Firefox; I've disabled OCSP in my Firefox profiles for years).

I suspect that Let's Encrypt's timeline of three to six months after Microsoft allows them to stop doing OCSP is better understood as 'one to two Let's Encrypt certificate rollovers', since all of LE's certificates are issued for 90 days. I also suspect that people will have enough problems with web servers (and perhaps client programs) that it will wind up being more toward the six month side.

Personally, I'm glad that OCSP is finally and definitely dying, and not just because I haven't had good experiences with it myself (as a Firefox user; as a website operator we never tried to add OCSP stapling). Regardless of its technical design, OCSP as an idea and a protocol is something that doesn't fit well into the modern Internet and how we understand the political issues involved with Internet-scale things (like how much they cost and who pays for them, what information they leak, what the consequences of an outage are, how much they require changes to slow-moving server software, and so on).

The Firefox source code's 'StaticPrefs' system (as of Firefox 128)

By: cks
15 July 2024 at 02:02

The news of the time interval is that Mozilla is selling out Firefox users once again (although Firefox remains far better than Chrome), in the form of 'Privacy-Preserving Attribution', which you might impolitely call 'browser managed tracking'. Mozilla enabled this by default in Firefox 128 (cf), and if you didn't know already you can read about how to disable it here or here. In the process of looking into all of this, I attempted to find where in the Firefox code the special dom.private-attribution.submission.enabled was actually used, but initially failed. Later, with guidance from @mcc's information, I managed to find the code and learned something about how Firefox handles certain 'about:config' preferences through a system called 'StaticPrefs'.

The Firefox source code defines a big collection of statically known about:config preferences, with commentary, default values, and their types, in modules/libpref/init/StaticPrefList.yaml (reading the comments for preferences you're interested is often quite useful). Our dom.private-attribution.submission.enabled preference is defined there. However, you will search the Firefox source tree in vain for any direct reference to accessing these preferences from C++ code, because their access functions are actually created as part of the build process, and even in the build tree they're accessed through #defines that are in StaticPrefListBegin.h. In the normal C++ code, all that you'll see is calls to 'StaticPrefs:<underscore_name>()', with the name of the function being the name of the preference with .'s converted to '_' (underscores), giving names like dom_private_attribution_submission_enabled. You can see this in dom/privateattribution/PrivateAttribution.cpp in functions like 'PrivateAttribution::SaveImpression()' (for as long as this source code lives in Firefox before Mozilla rips it out, which I hope is immediately).

(In the Firefox build tree, the generated file to look at is modules/libpref/init/StaticPrefList_dom.h.)

Some preferences in StaticPrefList.yaml aren't accessed this way by C++ code (currently, those with 'mirror: never' that are used in C++ code), so their name will appear in .cpp files in the Firefox source if you search for them. I believe that Firefox C++ code can also use additional preferences not listed in StaticPrefList, but it will obviously have to access those preferences using their preferences name. There are various C++ interfaces for working with such preferences, so you'll see things like a preference's value looked up by its name, or its name passed to methods like nsIPrincipal's 'IsURIInPrefList()'.

A significant amount of Firefox is implemented in JavaScript. As far as I know, that JavaScript doesn't use StaticPrefs or any equivalent of it and always accesses preferences by their normal about:config name.

βŒ₯ On Robots and Text

By: Nick Heer
20 June 2024 at 17:25

After Robb Knight found β€” and Wired confirmed β€” Perplexity summarizes websites which have followed its opt out instructions, I noticed a number of people making a similar claim: this is nothing but a big misunderstanding of the function of controls like robots.txt. A Hacker News comment thread contains several versions of these two arguments:

  • robots.txt is only supposed to affect automated crawling of a website, not explicit retrieval of an individual page.

  • It is fair to use a user agent string which does not disclose automated access because this request was not automated per se, as the user explicitly requested a particular page.

That is, publishers should expect the controls provided by Perplexity to apply only to its indexing bot, not a user-initiated page request. Wary of being the kind of person who replies to pseudonymous comments on Hacker News, this is an unnecessarily absolutist reading of how site owners expect the Robots Exclusion Protocol to work.

To be fair, that protocol was published in 1994, well before anyone had to worry about websites being used as fodder for large language model training. And, to be fairer still, it has never been formalized. A spec was only recently proposed in September 2022. It has so far been entirely voluntary, but the draft standard proposes a more rigid expectation that rules will be followed. Yet it does not differentiate between different types of crawlers β€” those for search, others for archival purposes, and ones which power the surveillance economy β€” and contains no mention of A.I. bots. Any non-human means of access is expected to comply.

The question seems to be whether what Perplexity is doing ought to be considered crawling. It is, after all, responding to a direct retrieval request from a user. This is subtly different from how a user might search Google for a URL, in which case they are asking whether that site is in the search engine’s existing index. Perplexity is ostensibly following real-time commands: go fetch this webpage and tell me about it.

But it clearly is also crawling in a more traditional sense. The New York Times and Wired both disallow PerplexityBot, yet I was able to ask it to summarize a set of recent stories from both publications. At the time of writing, the Wired summary is about seventeen hours outdated, and the Times summary is about two days old. Neither publication has changed its robots.txt directives recently; they were both blocking Perplexity last week, and they are blocking it today. Perplexity is not fetching these sites in real-time as a human or web browser would. It appears to be scraping sites which have explicitly said that is something they do not want.

Perplexity should be following those rules and it is shameful it is not. But what if you ask for a real-time summary of a particular page, as Knight did? Is that something which should be identifiable by a publisher as a request from Perplexity, or from the user?

The Robots Exclusion Protocol may be voluntary, but a more robust method is to block bots by detecting their user agent string. Instead of expecting visitors to abide by your β€œNo Homers Club” sign, you are checking IDs. But these strings are unreliable and there are often good reasons for evading user agent sniffing.

Perplexity says its bot is identifiable by both its user agent and the IP addresses from which it operates. Remember: this whole controversy is that it sometimes discloses neither, making it impossible to differentiate Perplexity-originating traffic from a real human being β€” and there is a difference.

A webpage being rendered through a web browser is subject to the quirks and oddities of that particular environment β€” ad blockers, Reader mode, screen readers, user style sheets, and the like β€” but there is a standard. A webpage being rendered through Perplexity is actually being reinterpreted and modified. The original text of the page is transformed through automated means about which neither the reader or the publisher has any understanding.

This is true even if you ask it for a direct quote. I asked for a full paragraph of a recent article and it mashed together two separate sections. They are direct quotes, to be sure, but the article must have been interpreted to generate this excerpt.1

It is simply not the case that requesting a webpage through Perplexity is akin to accessing the page via a web browser. It is more like automated traffic β€” even if it is being guided by a real person.

The existing mechanisms for restricting the use of bots on our websites are imperfect and limited. Yet they are the only tools we have right now to opt out of participating in A.I. services if that is something one wishes to do, short of putting pages or an entire site behind a user name and password. It is completely reasonable for someone to assume their signal of objection to any robotic traffic ought to be respected by legitimate businesses. The absolute least Perplexity can do is respecting those objections by clearly and consistently identifying itself, and excluding websites which have indicated they do not want to be accessed by these means.


  1. I am not presently blocking Perplexity, and my argument is not related to its ability to access the article. I am only illustrating how it reinterprets text.Β β†₯︎

It seems routine to see a bunch of browser User-Agents from the same IP

By: cks
20 June 2024 at 02:54

The sensible thing to do about the plague of nasty web spiders and other abusers is to ignore it unless it's actually affecting your site. There is always some new SEO optimizer or marketer or (these days) LLM dataset collector crawling your site, and trying to block all of them is a Sisyphean labour. However, I am not necessarily a sensible person, and sometimes I have potentially clever ideas. One of my recent clever ideas was to look for IP addresses that requested content here using several different HTTP 'User-Agent' values, because this is something I see some of the bad web spiders do. Unfortunately it turns out that this idea does not work out, at least for me and based on my traffic.

Some of the IP addresses that are using multiple User-Agent are clearly up to no good; for example, right now there appears to be an AWS-based stealth spider crawling Wandering Thoughts with hundreds of different User-Agents from its most prolific AWS IPs. Some of its User-Agent values are sufficiently odd that I suspect it may be randomly assembling some of them from parts (eg, pick a random platform, pick a random 'AppleWebKit/' value, pick a random 'Chrome/' value, pick a random 'Safari/' value, and put them all together regardless of whether any real browser ever used the combination). This crawler also sometimes requests robots.txt, for extra something. Would it go away if you got the right name for it? Would it go away if you blocked all robots? I am not going to bet on either.

Some of the sources of multiple User-Agent values are legitimate robots that are either presenting variants of their normal User-Agent, all of which clearly identify them, or are multiple different robots operated by the same organization (Google has several different robots, it turns out). A few sources appear to have variant versions of their normal User-Agent string; one has its regular (and clear) robot identifier and then a version that puts a ' Bot' on the end. These are small scale sources and probably small scale software.

Syndication feed fetchers (ie, RSS or Atom feed fetchers) are another interesting category. There are a number of feed aggregators pulling various of my syndication feeds for different people, which they put some sort of identifier for in their User-Agent (or a count of subscribers), along with their general identification. At the small scale, some people seem to be using more than one feed reader (or feed involved thing) on their machines, with each program fetching things independently and using its own User-Agent. Some of this could also be several different people behind the same IP, all pulling my feed with different programs.

This is in fact a general thing. If you have multiple different devices at home, all of them behind a single IPv4 address, and you visit Wandering Thoughts from more than one, I will see more than one User-Agent from the same IP. The same obviously happens with larger NAT environments.

An interesting and relatively new category is the Fediverse. When a Fediverse message is posted that includes a URL, many Fediverse servers will fetch the URL in order to generate a link preview for their users. To my surprise, a surprising number of these fetches seem to be routed through common front-end IPs. Each Fediverse server is using a User-Agent that identifies it specifically (as well as its Fediverse software), so I see multiple User-Agents from this front-end. Today, the most active front end IP seems to have been used by 39 different Mastodon servers. Meanwhile, some of the larger Fediverse servers use multiple IPs for this link preview generation.

The upshot of all of this is that looking at IPs that use a lot of different User-Agents is too noisy to be useful for me to identify new spider IPs. Something that shows up with a lot of different User-Agents might be yet another bot IP, or it might be legitimate, and it's too much work to try to tell them apart. Also, at least right now there are a lot of such bot IPs (due to the AWS-hosted crawler).

Oh well, not all clever ideas work out (and sometimes I feel like writing up even negative results, even if they were sort of predictable in hindsight).

Mixed content upgrades on the web in mid 2024

By: cks
15 June 2024 at 02:28

To simplify, mixed content happens when a web page is loaded over HTTPS but it uses 'http:' URLs to access resources like images, CSS, Javascript, and other things included on the page. Mixed content is a particular historical concern of ours for moving our main web server to HTTPS, because of pages maintained by people here that were originally written for a non-HTTPS world and which use those 'http:' URLs. Mixed content came to my mind recently because of Mozilla's announce that Firefox will upgrade more Mixed Content in Version 127, which tells you, in the small print, that Firefox 127 normally now either upgrades mixed content to HTTPS or blocks it; there is no more mixed content warnings. This is potentially troublesome for us when you couple it with Firefox moving towards automatically trying the main URL using HTTPS.

However, it turns out that this change to mixed content behavior is probably less scary to us than I thought, because it looks like Firefox is late to this party. Per this Chromium blog entry and the related Chromium feature status page, Chrome has been doing this since Chrome 86, which it appears was released back in late 2020. As for Safari, it appears that Safari just unconditionally blocks mixed content without trying to upgrade it under default circumstances (based on some casual Internet searching).

(The non-default circumstance is if the web server explicitly says to upgrade things with a 'upgrade-insecure-requests' Content-Security-Policy, which has been supported on all browsers for a long time. However this only applies to the website's own URLs; if the web page fetches things from other URLs as 'http:', I'm not sure if this will upgrade them.)

So people accessing our sites over HTTPS have probably mostly been subjected to mixed content upgrades and blocks for years. Only the Firefox people have been seeing mixed content (with mixed content warnings), and now they're probably getting a better experience.

What we really have to look out for is when browsers will start trying HTTP to HTTPS upgrades for URLs that are explicitly entered as 'http:' URLs. For people hitting our websites, such 'http:' URLs could come from bookmarks, links on other (old) websites, or URLs in published papers (or just papers that circulate online) or other correspondence.

(As long as browsers preserve a fallback to HTTP, this won't strictly be the death knell of HTTP on the web. It will probably be a death knell of things like old HTTP only image URLs that assorted people on the net keep using as image sources, but the people with those URLs may consider this a feature.)

Perplexity A.I. Is Lying About Its User Agent

By: Nick Heer
15 June 2024 at 15:49

Robb Knight blocked various web scrapers via robots.txt and through nginx. Yet Perplexity seemed to be able to access his site:

I got a perfect summary of the post including various details that they couldn’t have just guessed. Read the full response here. So what the fuck are they doing?

[…]

Before I got a chance to check my logs to see their user agent, Lewis had already done it. He got the following user agent string which certainly doesn’t include PerplexityBot like it should: […]

I am sure Perplexity will respond to this by claiming it was inadvertent, and it has fixed the problem, and it respects publishers’ choices to opt out of web scraping. What matters is how we have only a small amount of control over how our information is used on the web. It defaults to open and public β€” which is part of the web’s brilliance, until the audience is no longer human.

Unless we want to lock everything behind a login screen, the only mechanisms for control that we have are dependent on companies like Perplexity being honest about their bots. There is no chance this problem only affects the scraping of a handful of independent publishers; this is certainly widespread. Without penalty or legal reform, A.I. companies have little incentive not to do exactly the same as Perplexity.

βŒ₯ Permalink

Web applications should support being used behind a reverse proxy

By: cks
7 June 2024 at 02:40

I recently wrote about the power of using external authentication in a web application. The short version is that this lets you support any authentication system that someone can put together with a front end web server, with little to no work on your part (it also means that the security of that authentication code is not your problem). However, supporting this in your web application does have one important requirement, which is that you have to support being run behind a front end web server, which normally means having the front end server acting as a reverse proxy.

Proper support for being run behind a reverse proxy requires a certain amount of additional work and features; for example, you need to support a distinction between internal URLs and external URLs (and sometimes things can get confusing). I understand that it might be tempting to skip doing this work, but when web applications do that and insist on being run directly as a stand alone web server, they wind up with a number of issues. For one obvious case, when you run directly all of the authentication support has to be implemented by you, along with all of the authorization features that people will keep asking you for. Another case is that people will want you to do HTTPS, but you won't easily and automatically integrate with Let's Encrypt or other ACME based TLS certificate issuing and renewal systems.

(Let's set aside the issue of how good your TLS support will be as compared to a dedicated web server that has an active security team that worries about TLS issues and best practices. In general, sitting behind a reverse proxy removes the need to worry about a lot of HTTP and HTTPS issues, because you can count on a competent front end web server to deal with them for you.)

It used to be the case that a lot of web applications didn't support being run behind a reverse proxy (although a certain amount of that was PHP based applications that wanted to be run directly in the context of your main web server). My impression is that it's more common to support it these days, partly because various programming environments and frameworks make it easier to directly expose things over HTTP instead of anything else (HTTP has become the universal default protocol). However, even relatively recently I've seen applications where their support for reverse proxies was partial; you could run them behind one but not everything would necessarily work, or it could require additional things like HTML rewriting (although Prometheus Blackbox has added proper support for being behind a reverse proxy since I wrote that entry in 2018).

(I'd go so far as to suggest that most web applications that speak HTTP as their communication protocol should be designed to be used behind a reverse proxy when in 'production'. Direct HTTP should be considered to be for development setups, or maybe purely internal and trusted production usage. But this is partly my system administrator bias showing.)

The Deskilling of Web Development

By: Nick Heer
29 May 2024 at 17:55

Baldur Bjarnason:

But instead we’re all-in on deskilling the industry. Not content with removing CSS and HTML almost entirely from the job market, we’re now shifting towards the model where devs are instead β€œAI” wranglers. The web dev of the future will be an underpaid generalist who pokes at chatbot output until it runs without error, pokes at a copilot until it generates tests that pass with some coverage, and ships code that nobody understand and can’t be fixed if something goes wrong.

There are parallels in the history of software development to the various abstractions accumulated in a modern web development stack. Heck, you can find people throughout history bemoaning how younger generations lack some fundamental knowledge since replaced by automation or new technologies. It is always worth a gut-check about whether newer ideas are actually better. In the case of web development, what are we gaining and losing by eventually outsourcing much of it to generative software?

I think Bjarnason is mostly right: if web development become accessible by most through layers of A.I. and third-party frameworks, it is abstracted to such a significant extent that it becomes meaningless gibberish. In fairness, the way plain HTML, CSS, and JavaScript work is β€” to many β€” meaningless gibberish. It really is better for many people that creating things for the web has become something which does not require a specialized skillset beyond entering a credit card number. But that is distinct from web development. When someone has code-level responsibility, they have an obligation to understand how things work.

βŒ₯ Permalink

The power of using external authentication information in a web application

By: cks
21 May 2024 at 03:37

Recently, a colleague at work asked me if we were using the university's central authentication system to authenticate access to our Grafana server. I couldn't give them a direct answer because we use Apache HTTP Basic Authentication with a local password file, but I could give them a pointer. Grafana has the useful property that it can be configured to take authentication information from a reverse proxy through a HTTP header field, and you can set up Apache with Shibboleth authentication so that it uses the institutional authentication system (with appropriate work by several parties).

(Grafana also supports generic OAuth2 authentication, but I don't know if the university provides an OAuth2 (or OIDC) interface to our central authentication system.)

This approach of outsourcing all authentication decisions to some front-end system is quite useful to enable people to solve their specific authentication challenges for your software. The front-end doesn't particularly have to be Apache; as Grafana shows, you can have the front-end stuff some information in an additional HTTP header and rely on that. Apache is useful because people have written all sorts of authentication modules for it, but there are probably other web servers with similar capabilities.

You might think that your web application relying on OAuth2 would get you similar capabilities, but I believe there are two potential limitations. First, as the Grafana documentation on generic OAuth2 authentication demonstrates, this can make access control complicated. Either the system administrator has to stand up an OAuth2 provider that only authenticates some people and perhaps modifies the OAuth2 information returned from an upstream source, or your software needs to support enough features that you can exclude (or limit) certain people. Second, it doesn't necessarily enable a "single sign on" experience by default, because the OAuth2 provider may require people to approve passing information to your web application before it gives you the necessary information.

(Now that I look, I see that I can set my discount OIDC IdP to not require this approval, if I want to. In a production OIDC environment where the only applications are already trusted internal ones, we should set things this way, which I believe would give us a transparent SSO environment for purely OIDC/OAuth2 stuff.)

The drawback of entirely outsourcing authentication to a front-end system is that your web application doesn't provide busy system administrators with a standalone 'all in one' solution where they can plug in some configuration settings and be done; instead, they need a front-end as well and to configure authentication in it. I suspect that this is why Grafana supports other authentication methods (including OAuth2) along with delegating authentication to a front-end system. On the other hand, a web application that's purely for your own internal use could rely entirely on the front-end without worrying about authentication at all (this is effectively what we do with our use of Apache's HTTP Basic Authentication support).

My view is that your (general purpose) web application will never be able to support all of the unusual authentication systems that people will want; it's just too much code to write and maintain. Allowing a front-end to handle authentication (and with it possibly authorization, or at least primary authorization) makes that not your problem. You can implement a few common and highly requested stand alone authentication and authorization mechanisms (if you want to) and then push everything else away with 'do it in a front-end'.

PS: I'm not sure if what Grafana supports is actually OAuth2 or if it goes all the way to require OIDC, which I believe is effectively a superset of OAuth2; some of its documentation on this certainly seems to actually be working with OIDC providers. When past me wrote about mapping out my understanding of web based SSO systems, I neglected to work out and write down the practical differences between OAuth2 and OIDC.

One of OCSP's problems is the dominance of Chrome

By: cks
10 May 2024 at 03:23

To simplify greatly, OCSP is a set of ways to check whether or not a (public) TLS certificate has been revoked. It's most commonly considered in the context of web sites and things that talk to them. Today I had yet another problem because something was trying to check the OCSP status of a website and it didn't work. I'm sure there's a variety of contributing factors to this, but it struck me that one of them is that Chrome, the dominant browser, doesn't do OCSP checks.

If you break the dominant browser, people notice and fix it; indeed, people prioritize testing against the dominant browser and making sure that things are going to work before you put them in production. But if something is not supported in the dominant browser, it's much less noticeable if it breaks. And if something breaks in a way that doesn't affect even less well used browsers (like Firefox), the odds of it being noticed are even lower. Something in the broad network environment broke OCSP for wget, but perhaps not for browsers? Good luck having that noticed, much less fixed.

Of course this leads to a spiral. When people run into OCSP problems on less common platforms, they can either try to diagnose and fix the problem (if fixing it is even within their power), or they can bypass or disable OCSP. Often they'll chose the latter (as I did), at which point they increase the number of non-OCSP people in the world and so further reduce the chances of OCSP problems being noticed and fixed. For instance, I couldn't cross-check the OCSP situation with Firefox, because I'd long ago disabled OCSP in Firefox after it caused me problems there.

I don't have any particular solutions, and since I consider OCSP to basically be a failure in practice I'm not too troubled by the problem, at least for OCSP.

PS: In this specific situation, OCSP was vanishingly unlikely to actually be telling me that there was a real security problem. If Github had to revoke any of its web certificates due to them being compromised, I'm sure I would have heard about it because it would be very big news.

On the duration of self-signed TLS (website) certificates

By: cks
19 April 2024 at 03:15

We recently got some hardware that has a networked management interface, which in today's world means it has a web server and further, this web server does HTTPS. Naturally, it has a self-signed TLS certificate (one it apparently generated on startup). For reasons beyond the scope of this entry we decided that we wanted to monitor this web server interface to make sure it was answering. This got me curious about how long the duration of its self-signed TLS certificate was, which turns out to be one year. I find myself not sure how I feel about this.

On the one hand, it is a little bit inconvenient for us that the expiry time isn't much longer. Our standard monitoring collects the TLS certificate expiry times of TLS certificates we encounter and we generate alerts for impending TLS certificate expiry, so if we don't do something special for this hardware, in a year or so we'll be robotically alerting that these self signed TLS certificates are about to 'expire'.

On the other hand, browsers don't actually care about the nominal expiry date of self-signed certificates; either your browser trusts them (because you told it to) or it doesn't, and the TLS certificate 'expiring' won't change this (or at most will make your browser ask you again if you want to trust the TLS certificate). We have server IPMIs with self-signed HTTPS TLS certificates that expired in 2020, and I've never noticed when I talked to them. Also, it's possible that (some) modern browsers will be upset with long-duration self-signed TLS certificates in the same way that they limit the duration of regular website TLS certificates. I haven't actually generated a long duration self-signed TLS certificate to test.

(It's possible that we'll want to talk to a HTTP API on these management interfaces with non-browser tools. However, since expired TLS certificates are probably very common on this sort of management interface, I suspect that the relevant tools also don't care that a self-signed TLS certificate is expired.)

I'm probably not going to do anything to the actual devices, although I suspect I could prepare and upload a long duration self-signed certificate if I wanted to. I will hopefully remember to fix our alerts to exclude these TLS certificates before this time next year.

PS: The other problem with long duration self-signed TLS certificates is that even if browsers accept them today, maybe they won't be so happy with them in a year or three. The landscape of what browsers will accept is steadily changing, although perhaps someday it will reach a steady state.

A corner case in Firefox's user interface for addon updates

By: cks
14 April 2024 at 02:56

One of the things that make browsers interesting programs, in one sense, is that they generally have a lot of options, which leads to a lot of potential different behavior, which creates plenty of places for bugs to hide out. One of my favorite ones in Firefox is a long-standing user interface bug in Firefox's general UI for addon updates, one that's hard for ordinary people to come across because it requires a series of unusual settings and circumstances.

If you have automatic updates for your addons turned off, Firefox's 'about:addons' interface for managing your extensions will still periodically check for updates to your addons, and if there are any, it will add an extra tab in the UI that lists addons with pending updates. This tab has a count of how many pending updates there are, because why not. The bug is that if the same addon comes out with more than one update (that Firefox notices) before you update it, this count of pending updates (and the tab itself) will stick at one or more, even though there are no actual addon updates that haven't been applied.

(You can argue that there are actually two bugs here, and that Firefox should be telling you the number of addons with pending updates, not the number of pending updates to addons. The count is clearly of pending updates, because if Firefox has seen two updates for one addon, it will report a count of '2' initially.)

To reach this bug you need a whole series of unusual circumstances. You need to turn off automatic addon updates, you have to be using an addon that updates frequently, and then you need to leave your Firefox running for long enough (and not manually apply addon updates), because Firefox re-counts pending updates when it's restarted. In my case, I see this because I'm using the beta releases of uBlock Origin, which update relatively frequently, and even then I usually see it only on my office Firefox, which I often leave running but untouched for four days at a time.

(It may be possible to see this with a long-running Firefox even if you have addon updates automatically applied, because addon updates sometimes ask you to opt in to applying them right now instead of when Firefox restarts. I believe an addon asking this question may stop further updates to the addon from being applied, leading to the same 'multiple pending updates' counting issue.)

Browsers are complex programs with a large set of UI interactions (especially when it comes to handling modern interactive web pages and their Javascript). In a way it's a bit of a miracle that they work as well as they do, and I'm sure that there's other issues like this in the less frequented parts of all browsers.

Some notes on Firefox's media autoplay settings in practice as of Firefox 124

By: cks
30 March 2024 at 02:43

I've been buying digital music from one of the reasonably good online sources of it (the one that recently got acquired, again, making people nervous about its longer term future). In addition to the DRM-free lossless downloads of your purchases, this place lets you stream albums through your web browser, which in my case is an instance of Firefox. Recently, I noticed that my Firefox instance at work would seamlessly transition from one track to the next track of an album I was streaming, regardless of which label's sub-site I was on, while my home Firefox would not; when one track ended, the home Firefox would normally pause rather than start playing the next track.

(I listen to both albums I've purchased and albums I haven't and I'm just checking out. The former play through the main page for my collection, while the latter are scattered around various URLs, because each label or artist gets a <label>.<mumble>.com sub-domain for its releases, and then each release has its own page. For the obvious reasons, I long ago set my home Firefox to allow my collection's main page to autoplay music so it could seamlessly move from one track to the next.)

Both browser instances were set to disallow autoplay in general in the Preferences β†’ Privacy & Security (see Allow or block media autoplay in Firefox), and inspection of the per-site settings showed that my work Firefox actually had granted no autoplay permissions to sites while my home Firefox had a list of various subdomains for this particular vendor that were allowed to autoplay. After spelunking my about:config, I identified this as a difference in media.autoplay.blocking_policy, where the work Firefox had this set to the default of '0' while my home Firefox had a long-standing setting of '2'.

As discussed in Mozilla's wiki page on blocking media autoplay, the default setting for this preference allows autoplay once you've interacted with that tab, while my setting of '2' requires that you always click to (re)start audio or video playing (unless the site has been granted autoplay permissions). Historically I set this to '2' to try to stop Youtube from autoplaying a second video after my first one had finished. In practice this usage has been rendered functionally obsolete by Youtube's own 'disable autoplay' setting in its video player (although it still works to prevent autoplay if I've forgotten to turn that on in this Firefox session or if Youtube is in a playlist and ignores that setting).

(For both Youtube and this digital music source, a setting of '1', a transient user gesture activation, is functionally equivalent to '2' for me because it will normally be more than five seconds before the video or track finishes playing, which means that the permission will have expired by the time the site wants to advance to the next thing.)

Since I check out multi-track albums much more often than I look at Youtube videos (in this Firefox), and Youtube these days does have a reliable 'disable autoplay' setting, I opted to leave media.autoplay.blocking_policy set to '0' in the work Firefox instance I use for this stuff and I've just changed it to '0' in my home one as well. I could avoid this if I set up a custom profile for this music source, but I haven't quite gotten to that point yet.

(I do wish Firefox allowed, effectively, per-site settings of this as part of the per-site autoplay permissions, but I also understand why they don't; I'm sure the Preferences and per-site settings UI complexity would be something to see.)

(If I'd thought to check my previous notes on this I probably would have been led to media.autoplay.blocking_policy right away, but it didn't occur to me to check here, even though I knew I'd fiddled a lot with Firefox media autoplay over the years. My past self writing things down here doesn't guarantee that my future (present) self will remember that they exist.)

PS: I actually go back and forth on automatically moving on to the next track of an album I'm checking out, because the current 'stop after one track' behavior does avoid me absently listening to the whole thing. If I find myself unintentionally listening to too many albums that in theory I'm only checking out, I'll change the setting back.

What do we count as 'manual' management of TLS certificates

By: cks
13 March 2024 at 02:29

Recently I casually wrote about how even big websites may still be manually managing TLS certificates. Given that we're talking about big websites, this raises a somewhat interesting question of what we mean by 'manual' and 'automatic' TLS certificate management.

A modern big website probably has a bunch of front end load balancers or web servers that terminate TLS, and regardless of what else is involved in their TLS certificate management it's very unlikely that system administrators are logging in to each one of them to roll over its TLS certificate to a new one (any more than they manually log in to those servers to deploy other changes). At the same time, if the only bit of automation involved in TLS certificate management is deploying a TLS certificate across the fleet (once you have it) I think most people would be comfortable still calling that (more or less) 'manual' TLS certificate management.

As a system administrator who used to deal with TLS certificates (back then I called them SSL certificates) the fully manual way, I see three broad parts to fully automated management of TLS certificates:

  • automated deployment, where once you have the new TLS certificate you don't have to copy files around on a particular server, restart the web server, and so on. Put the TLS certificate in the right place and maybe push a button and you're done.

  • automated issuance of TLS certificates, where you don't have to generate keys, prepare a CSR, go to a web site, perhaps put in your credit card information or some other 'cost you money' stuff, perhaps wait for some manual verification or challenge by email, and finally download your signed certificate. Instead you run a program and you have a new TLS certificate.

  • automated renewal of TLS certificates, where you don't have to remember to do anything by hand when your TLS certificates are getting close enough to their expiry time. (A lesser form of automated renewal is automated reminders that you need to manually renew.)

As a casual thing, if you don't have fully automated management of TLS certificates I would say you had 'manual management' of them, because a human had to do something to make the whole process go. If I was trying to be precise and you had automated deployment but not the other two, I might describe you as having 'mostly manual management' of your TLS certificates. If you had automated issuance (and deployment) but no automated renewals, I might say you had 'partially automated' or 'partially manual' TLS certificate management.

(You can have automated issuance but not automated deployment or automated renewal and at that point I'd probably still say you had 'manual' management, because people still have to be significantly involved even if you don't have to wrestle with a TLS Certificate Authority's website and processes.)

I believe that at least some TLS Certificate Authorities support automated issuance of year long certificates, but I'm not sure. Now that I've looked, I'm going to have to stop assuming that a website using a year-long TLS certificate is a reliable sign that they're not using automated issuance.

Even big websites may still be manually managing TLS certificates (or close)

By: cks
19 February 2024 at 03:06

I've written before about how people's soon to expire TLS certificates aren't necessarily a problem, because not everyone manages their TLS certificates through Let's Encrypt like '30 day in advance automated renewal' and perhaps short-lived TLS certificates. For example, some places (like Facebook) have automation but seem to only deploy TLS certificates that are quite close to expiry. Other places at least look as if they're still doing things by hand, and recently I got to watch an example of that.

As I mentioned yesterday, the department outsources its public website to a SaaS CMS provider. While the website has a name here for obvious reasons, it uses various assets that are hosted on sites under the SaaS provider's domain names (both assets that are probably general and assets, like images, that are definitely specific to us). For reasons beyond the scope of this entry, we monitor the reachability of these additional domain names with our metrics system. This only checks on-campus reachability, of course, but that's still important even if most visitors to the site are probably from outside the university.

As a side effect of this reachability monitoring, we harvest the TLS certificate expiry times of these domains, and because we haven't done anything special about it, they get show on our core status dashboard along side the expiry times of TLS certificates that we're actually responsible for. The result of this was that recently I got to watch their TLS expiry times count down to only two weeks away, which is lots of time from one view while also alarmingly little if you're used to renewals 30 days in advance. Then they flipped over to new a new year-long TLS certificate and our dashboard was quiet again (except for the next such external site that has dropped under 30 days).

Interestingly, the current TLS certificate was issued about a week before it was deployed, or at least its Not-Before date is February 9th at 00:00 UTC and it seems to have been put into use this past Friday, the 16th. One reason for this delay in deployment is suggested by our monitoring, which seems to have detected traces of a third certificate sometimes being visible, this one expiring June 23rd, 2024. Perhaps there were some deployment challenges across the SaaS provider's fleet of web servers.

(Their current TLS certificate is actually good for just a bit over a year, with a Not-Before of 2024-02-09 and a Not-After of 2025-02-28. This is presumably accepted by browsers, even though it's bit over 365 days; I haven't paid attention to the latest restrictions from places like Apple.)

We outsource our public web presence and that's fine

By: cks
18 February 2024 at 02:39

I work for a pretty large Computer Science department, one where we have the expertise and need to do a bunch of internal development and in general we maintain plenty of things, including websites. Thus, it may surprise some people to learn that the department's public-focused web site is currently hosted externally on a SaaS provider. Even the previous generation of our outside-facing web presence was hosted and managed outside of the department. To some, this might seem like the wrong decision for a department of Computer Science (of all people) to make; surely we're capable of operating our own web presence and thus should as a matter of principle (and independence).

Well, yes and no. There are two realities. The first is that a modern content management system is both a complex thing (to develop and to generally to operate and maintain securely) and a commodity, with many organizations able to provide good ones at competitive prices. The second is that both the system administration and the publicity side of the department only have so many people and so much time. Or, to put it another way, all of us have work to get done.

The department has no particular 'competitive advantage' in running a CMS website; in fact, we're almost certain to be worse at it than someone doing it at scale commercially, much like what happened with webmail. If the department decided to operate its own CMS anyway, it would be as a matter of principle (which principles would depend on whether the CMS was free or paid for). So far, the department has not decided that this particular principle is worth paying for, both in direct costs and in the opportunity costs of what that money and staff time could otherwise be used for.

Personally I agree with that decision. As mentioned, CMSes are a widely available (but specialized) commodity. Were we to do it ourselves, we wouldn't be, say, making a gesture of principle against the centralization of CMSes. We would merely be another CMS operator in an already crowded pond that has many options.

(And people here do operate plenty of websites and web content on our own resources. It's just that the group here responsible for our public web presence found it most effective and efficient to use a SaaS provider for this particular job.)

❌
❌