You may have heard that there are a lot of crawlers out there these
days, many of them apparently harvesting training data for LLMs. Recently I've
been getting more strict about access to this blog, so
for my own interest I'm going to show statistics on what HTTP status
codes all of the requests to here got in the past roughly 21 hours
and a bit. I think this is about typical, although there may be more
blocked things than usual.
I'll start with the overall numbers for all requests:
22792 403 [45%]
9207 304 [18.3%]
9055 200 [17.9%]
8641 429 [17.1%]
518 301
58 400
33 404
2 206
1 302
HTTP 403 is the error code that people get on blocked access; I'm
not sure what's producing the HTTP 400s. The two HTTP 206s were
from LinkedIn's bot against a recent entry
and completely puzzle me. Some of the blocked access is major web
crawlers requesting things that they shouldn't (Bing is a special
repeat offender here), but many of them are not. Between HTTP 403s
and HTTP 429s, 62% or so of the requests overall were rejected and
only 36% got a useful reply.
(With less thorough and active blocks, that would be a lot more
traffic for Wandering Thoughts to handle.)
The picture for syndication feeds is rather different, as you might
expect, but not quite as different as I'd like:
9136 304 [39.5%]
8641 429 [37.4%]
3614 403 [15.6%]
1663 200 [ 7.2%]
19 301
Some of those rejections are for major web crawlers and almost a
thousand are for a pair of prolific, repeat high volume request
sources, but a lot of them aren't. Feed requests account for 23073
requests out of a total of 50307, or about 45% of the requests. To
me this feels quite low for anything plausibly originated from
humans; most of the time I expect feed requests to significantly
outnumber actual people visiting.
(In terms of my syndication feed rate limiting, there were 19440 'real' syndication
feed requests (84% of the total attempts), and out of them 44.4%
were rate-limited. That's actually a lower level of rate limiting
than I expected; possibly various feed fetchers have actually noticed
it and reduced their attempt frequency. 46.9% made successful
conditional GET requests (ones that got a HTTP 304 response) and
8.5% actually fetched feed data.)
DWiki, the wiki engine behind the blog, has a concept of
alternate 'views' of pages. Syndication
feeds are alternate views, but so are a bunch of other things.
Excluding syndication feeds, the picture for requests of alternate
views of pages is:
5499 403
510 200
39 301
3 304
The most blocked alternate views are:
1589 ?writecomment
1336 ?normal
1309 ?source
917 ?showcomments
(The most successfully requested view is '?showcomments', which isn't
really a surprise to me; I expect search engines to look through that,
for one.)
If I look only at plain requests, not requests for syndication feeds
or alternate views, I see:
13679 403 [64.5%]
6882 200 [32.4%]
460 301
68 304
58 400
33 404
2 206
1 302
This means the breakdown of traffic is 21183 normal requests (42%),
45% feed requests, and the remainder for alternate views, almost all
of which were rejected.
Out of the HTTP 403 rejections across all requests, the 'sources'
break down something like this:
7116 Forged Chrome/129.0.0.0 User-Agent
1451 Bingbot
1173 Forged Chrome/121.0.0.0 User-Agent
930 PerplexityBot ('AI' LLM data crawler)
915 Blocked sources using a 'Go-http-client/1.1' User-Agent
Those HTTP 403 rejections came from 12619 different IP addresses,
in contrast to the successful requests (HTTP 2xx and 3xx codes),
which came from 18783 different IP addresses. After looking into
the ASN
breakdown of those IPs, I've decided that I can't write anything
about them with confidence, and it's possible that part of what is
going on is that I have mis-firing blocking rules (alternately, I'm
being hit from a big network of compromised machines being used as
proxies, perhaps the same network that is the Chrome/129.0.0.0
source). However, some of the ASNs that show up highly are definitely
ones I recognize from other contexts, such as attempted comment spam.
Update: Well that was a learning experience about actual browser
User-Agents. Those 'Chrome/129.0.0.0' User-Agents may well not have
been so forged (although people really should be running more current
versions of Chrome). I apologize to the people using real current
Chrome versions that were temporarily unable to read the blog because
of my overly-aggressive blocks.