Pixel Envy
Trapping Misbehaving Bots in an A.I. Labyrinth 22 March 2025 at 04:32

Trapping Misbehaving Bots in an A.I. Labyrinth

By: Nick Heer

22 March 2025 at 04:32

Reid Tatoris, Harsh Saxena, and Luis Miglietti, of Cloudflare:

Today, we’re excited to announce AI Labyrinth, a new mitigation approach that uses AI-generated content to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect “no crawl” directives. When you opt in, Cloudflare will automatically deploy an AI-generated set of linked pages when we detect inappropriate bot activity, without the need for customers to create any custom rules.

Two thoughts:

This is amusing. Nothing funnier than using someone’s own words or, in this case, technology against them.
This is surely going to lead to the same arms race as exists now between privacy protections and hostile adtech firms. Right?

⌥ Permalink

Pixel Envy
⌥ On Robots and Text 20 June 2024 at 17:25

⌥ On Robots and Text

Pixel Envy

By: Nick Heer

20 June 2024 at 17:25

After Robb Knight found — and Wired confirmed — Perplexity summarizes websites which have followed its opt out instructions, I noticed a number of people making a similar claim: this is nothing but a big misunderstanding of the function of controls like robots.txt. A Hacker News comment thread contains several versions of these two arguments:

robots.txt is only supposed to affect automated crawling of a website, not explicit retrieval of an individual page.
It is fair to use a user agent string which does not disclose automated access because this request was not automated per se, as the user explicitly requested a particular page.

That is, publishers should expect the controls provided by Perplexity to apply only to its indexing bot, not a user-initiated page request. Wary of being the kind of person who replies to pseudonymous comments on Hacker News, this is an unnecessarily absolutist reading of how site owners expect the Robots Exclusion Protocol to work.

To be fair, that protocol was published in 1994, well before anyone had to worry about websites being used as fodder for large language model training. And, to be fairer still, it has never been formalized. A spec was only recently proposed in September 2022. It has so far been entirely voluntary, but the draft standard proposes a more rigid expectation that rules will be followed. Yet it does not differentiate between different types of crawlers — those for search, others for archival purposes, and ones which power the surveillance economy — and contains no mention of A.I. bots. Any non-human means of access is expected to comply.

The question seems to be whether what Perplexity is doing ought to be considered crawling. It is, after all, responding to a direct retrieval request from a user. This is subtly different from how a user might search Google for a URL, in which case they are asking whether that site is in the search engine’s existing index. Perplexity is ostensibly following real-time commands: go fetch this webpage and tell me about it.

But it clearly is also crawling in a more traditional sense. The New York Times and Wired both disallow PerplexityBot, yet I was able to ask it to summarize a set of recent stories from both publications. At the time of writing, the Wired summary is about seventeen hours outdated, and the Times summary is about two days old. Neither publication has changed its robots.txt directives recently; they were both blocking Perplexity last week, and they are blocking it today. Perplexity is not fetching these sites in real-time as a human or web browser would. It appears to be scraping sites which have explicitly said that is something they do not want.

Perplexity should be following those rules and it is shameful it is not. But what if you ask for a real-time summary of a particular page, as Knight did? Is that something which should be identifiable by a publisher as a request from Perplexity, or from the user?

The Robots Exclusion Protocol may be voluntary, but a more robust method is to block bots by detecting their user agent string. Instead of expecting visitors to abide by your “No Homers Club” sign, you are checking IDs. But these strings are unreliable and there are often good reasons for evading user agent sniffing.

Perplexity says its bot is identifiable by both its user agent and the IP addresses from which it operates. Remember: this whole controversy is that it sometimes discloses neither, making it impossible to differentiate Perplexity-originating traffic from a real human being — and there is a difference.

A webpage being rendered through a web browser is subject to the quirks and oddities of that particular environment — ad blockers, Reader mode, screen readers, user style sheets, and the like — but there is a standard. A webpage being rendered through Perplexity is actually being reinterpreted and modified. The original text of the page is transformed through automated means about which neither the reader or the publisher has any understanding.

This is true even if you ask it for a direct quote. I asked for a full paragraph of a recent article and it mashed together two separate sections. They are direct quotes, to be sure, but the article must have been interpreted to generate this excerpt.¹

It is simply not the case that requesting a webpage through Perplexity is akin to accessing the page via a web browser. It is more like automated traffic — even if it is being guided by a real person.

The existing mechanisms for restricting the use of bots on our websites are imperfect and limited. Yet they are the only tools we have right now to opt out of participating in A.I. services if that is something one wishes to do, short of putting pages or an entire site behind a user name and password. It is completely reasonable for someone to assume their signal of objection to any robotic traffic ought to be respected by legitimate businesses. The absolute least Perplexity can do is respecting those objections by clearly and consistently identifying itself, and excluding websites which have indicated they do not want to be accessed by these means.

I am not presently blocking Perplexity, and my argument is not related to its ability to access the article. I am only illustrating how it reinterprets text. ↥︎

Pixel Envy
Perplexity A.I. Is Lying About Its User Agent 15 June 2024 at 15:49

Perplexity A.I. Is Lying About Its User Agent

Pixel Envy

By: Nick Heer

15 June 2024 at 15:49

Robb Knight blocked various web scrapers via robots.txt and through nginx. Yet Perplexity seemed to be able to access his site:

I got a perfect summary of the post including various details that they couldn’t have just guessed. Read the full response here. So what the fuck are they doing?

[…]

Before I got a chance to check my logs to see their user agent, Lewis had already done it. He got the following user agent string which certainly doesn’t include PerplexityBot like it should: […]

I am sure Perplexity will respond to this by claiming it was inadvertent, and it has fixed the problem, and it respects publishers’ choices to opt out of web scraping. What matters is how we have only a small amount of control over how our information is used on the web. It defaults to open and public — which is part of the web’s brilliance, until the audience is no longer human.

Unless we want to lock everything behind a login screen, the only mechanisms for control that we have are dependent on companies like Perplexity being honest about their bots. There is no chance this problem only affects the scraping of a handful of independent publishers; this is certainly widespread. Without penalty or legal reform, A.I. companies have little incentive not to do exactly the same as Perplexity.

⌥ Permalink

Normal view