⌥ On Robots and Text

By: Nick Heer

20 June 2024 at 17:25

After Robb Knight found — and Wired confirmed — Perplexity summarizes websites which have followed its opt out instructions, I noticed a number of people making a similar claim: this is nothing but a big misunderstanding of the function of controls like robots.txt. A Hacker News comment thread contains several versions of these two arguments:

robots.txt is only supposed to affect automated crawling of a website, not explicit retrieval of an individual page.
It is fair to use a user agent string which does not disclose automated access because this request was not automated per se, as the user explicitly requested a particular page.

That is, publishers should expect the controls provided by Perplexity to apply only to its indexing bot, not a user-initiated page request. Wary of being the kind of person who replies to pseudonymous comments on Hacker News, this is an unnecessarily absolutist reading of how site owners expect the Robots Exclusion Protocol to work.

To be fair, that protocol was published in 1994, well before anyone had to worry about websites being used as fodder for large language model training. And, to be fairer still, it has never been formalized. A spec was only recently proposed in September 2022. It has so far been entirely voluntary, but the draft standard proposes a more rigid expectation that rules will be followed. Yet it does not differentiate between different types of crawlers — those for search, others for archival purposes, and ones which power the surveillance economy — and contains no mention of A.I. bots. Any non-human means of access is expected to comply.

The question seems to be whether what Perplexity is doing ought to be considered crawling. It is, after all, responding to a direct retrieval request from a user. This is subtly different from how a user might search Google for a URL, in which case they are asking whether that site is in the search engine’s existing index. Perplexity is ostensibly following real-time commands: go fetch this webpage and tell me about it.

But it clearly is also crawling in a more traditional sense. The New York Times and Wired both disallow PerplexityBot, yet I was able to ask it to summarize a set of recent stories from both publications. At the time of writing, the Wired summary is about seventeen hours outdated, and the Times summary is about two days old. Neither publication has changed its robots.txt directives recently; they were both blocking Perplexity last week, and they are blocking it today. Perplexity is not fetching these sites in real-time as a human or web browser would. It appears to be scraping sites which have explicitly said that is something they do not want.

Perplexity should be following those rules and it is shameful it is not. But what if you ask for a real-time summary of a particular page, as Knight did? Is that something which should be identifiable by a publisher as a request from Perplexity, or from the user?

The Robots Exclusion Protocol may be voluntary, but a more robust method is to block bots by detecting their user agent string. Instead of expecting visitors to abide by your “No Homers Club” sign, you are checking IDs. But these strings are unreliable and there are often good reasons for evading user agent sniffing.

Perplexity says its bot is identifiable by both its user agent and the IP addresses from which it operates. Remember: this whole controversy is that it sometimes discloses neither, making it impossible to differentiate Perplexity-originating traffic from a real human being — and there is a difference.

A webpage being rendered through a web browser is subject to the quirks and oddities of that particular environment — ad blockers, Reader mode, screen readers, user style sheets, and the like — but there is a standard. A webpage being rendered through Perplexity is actually being reinterpreted and modified. The original text of the page is transformed through automated means about which neither the reader or the publisher has any understanding.

This is true even if you ask it for a direct quote. I asked for a full paragraph of a recent article and it mashed together two separate sections. They are direct quotes, to be sure, but the article must have been interpreted to generate this excerpt.¹

It is simply not the case that requesting a webpage through Perplexity is akin to accessing the page via a web browser. It is more like automated traffic — even if it is being guided by a real person.

The existing mechanisms for restricting the use of bots on our websites are imperfect and limited. Yet they are the only tools we have right now to opt out of participating in A.I. services if that is something one wishes to do, short of putting pages or an entire site behind a user name and password. It is completely reasonable for someone to assume their signal of objection to any robotic traffic ought to be respected by legitimate businesses. The absolute least Perplexity can do is respecting those objections by clearly and consistently identifying itself, and excluding websites which have indicated they do not want to be accessed by these means.

I am not presently blocking Perplexity, and my argument is not related to its ability to access the article. I am only illustrating how it reinterprets text. ↥︎

Pixel Envy
Adobe Codifies Pledge Not to Train A.I. on Customer Data 18 June 2024 at 23:32

Adobe Codifies Pledge Not to Train A.I. on Customer Data

Pixel Envy

By: Nick Heer

18 June 2024 at 23:32

Ina Fried, Axios:

Adobe on Tuesday updated its terms of service to make explicit that it won’t train AI systems using customer data.

The move follows an uproar over largely unrelated changes Adobe made in recent days to its terms of service — which contained wording that some customers feared was granting Adobe broad rights to customer content.

Again, I must ask whether businesses are aware of how little trust there currently is in technology firms’ A.I. use. People misinterpret legal documents all the time — a minor consequence of how we have normalized signing a non-negotiable contract every time we create a new account. Most people are not equipped to read and comprehend the consequences of those contracts, and it is unsurprising they can assume the worst.

⌥ Permalink

Pixel Envy
Perplexity A.I. Is Lying About Its User Agent 15 June 2024 at 15:49

Perplexity A.I. Is Lying About Its User Agent

Pixel Envy

By: Nick Heer

15 June 2024 at 15:49

Robb Knight blocked various web scrapers via robots.txt and through nginx. Yet Perplexity seemed to be able to access his site:

I got a perfect summary of the post including various details that they couldn’t have just guessed. Read the full response here. So what the fuck are they doing?

[…]

Before I got a chance to check my logs to see their user agent, Lewis had already done it. He got the following user agent string which certainly doesn’t include PerplexityBot like it should: […]

I am sure Perplexity will respond to this by claiming it was inadvertent, and it has fixed the problem, and it respects publishers’ choices to opt out of web scraping. What matters is how we have only a small amount of control over how our information is used on the web. It defaults to open and public — which is part of the web’s brilliance, until the audience is no longer human.

Unless we want to lock everything behind a login screen, the only mechanisms for control that we have are dependent on companies like Perplexity being honest about their bots. There is no chance this problem only affects the scraping of a handful of independent publishers; this is certainly widespread. Without penalty or legal reform, A.I. companies have little incentive not to do exactly the same as Perplexity.

⌥ Permalink

Pixel Envy
Amazon Executives May Be Personally Liable for Unintentional Prime Registrations 30 May 2024 at 23:55

Amazon Executives May Be Personally Liable for Unintentional Prime Registrations

Pixel Envy

By: Nick Heer

30 May 2024 at 23:55

Ashley Belanger, Ars Technica:

But the judge apparently did not find Amazon’s denials completely persuasive. Viewing the FTC’s complaint “in the light most favorable to the FTC,” Judge John Chun concluded that “the allegations sufficiently indicate that Amazon had actual or constructive knowledge that its Prime sign-up and cancellation flows were misleading consumers.”

[…]

One such trick that Chun called out saw Amazon offering two-day free shipping with the click of a button at checkout that also signed customers up for Prime even if they didn’t complete the purchase.

“With the offer of Amazon Prime for the purpose of free shipping, reasonable consumers could assume that they would not proceed with signing up for Prime unless they also placed their order,” Chun said, ultimately rejecting Amazon’s claims that all of its “disclosures would be clear and conspicuous to any reasonable consumer.”

This is far from the only instance of scumbag design cited by Chun, and it is bizarre to me that anybody would defend choices like these.

⌥ Permalink

Pixel Envy
Scarlett Johansson Wants Answers About ChatGPT Voice That Sounds Like ‘Her’ 21 May 2024 at 14:02

Scarlett Johansson Wants Answers About ChatGPT Voice That Sounds Like ‘Her’

Pixel Envy

By: Nick Heer

21 May 2024 at 14:02

Bobby Allyn, NPR:

Lawyers for Scarlett Johansson are demanding that OpenAI disclose how it developed an AI personal assistant voice that the actress says sounds uncannily similar to her own.

[…]

Johansson said that nine months ago [Sam] Altman approached her proposing that she allow her voice to be licensed for the new ChatGPT voice assistant. He thought it would be “comforting to people” who are uneasy with AI technology.

“After much consideration and for personal reasons, I declined the offer,” Johansson wrote.

In a defensive blog post, OpenAI said it believes “AI voices should not deliberately mimic a celebrity’s distinctive voice” and that any resemblance between Johansson and the “Sky” voice demoed earlier this month is basically a coincidence, a claim only slightly undercut by a single-word tweet posted by Altman.

OpenAI’s voice mimicry — if you want to be generous — and that iPad ad add up to a banner month for technology companies’ relationship to the arts.¹ Are there people in power at these companies who can see how behaviours like these look? We are less than a year out from both the most recent Hollywood writers’ and actors’ strikes, both of which reflected in part A.I. anxieties.

Update: According to the Washington Post, the sound-alike voice really does just sound alike.

A more minor but arguably funnier faux pas occurred when Apple confirmed to the Wall Street Journal the authenticity of the statement it gave to Ad Age — both likely paywalled — but refused to send it to the Journal. ↥︎

⌥ Permalink

Pixel Envy
Slack’s Sneaky A.I. Training Policy 17 May 2024 at 22:21

Slack’s Sneaky A.I. Training Policy

Pixel Envy

By: Nick Heer

17 May 2024 at 22:21

Corey Quinn:

I’m sorry Slack, you’re doing fucking WHAT with user DMs, messages, files, etc? I’m positive I’m not reading this correctly.

[Screenshot of the opt out portion of Slack’s “privacy principles”: Contact us to opt out. If you want to exclude your Customer Data from Slack global models, you can opt out. […] ]

Slack replied:

Hello from Slack! To clarify, Slack has platform-level machine-learning models for things like channel and emoji recommendations and search results. And yes, customers can exclude their data from helping train those (non-generative) ML models. Customer data belongs to the customer. We do not build or train these models in such a way that they could learn, memorize, or be able to reproduce some part of customer data. […]

One thing I like about this statement is how the fifth word is “clarify” and then it becomes confusing. Based on my reading of its “privacy principles”, I think Slack’s “global model” is so named because it is available to everyone and is a generalist machine learning model for small in-workspace suggestions, while its LLM is called “Slack AI” and it is a paid add-on. But I could be wrong, and that is confusing as hell.

Ivan Mehta and Ingrid Lunden, TechCrunch:

In its terms, Slack says that if customers opt out of data training, they would still benefit from the company’s “globally trained AI/ML models.” But again, in that case, it’s not clear then why the company is using customer data in the first place to power features like emoji recommendations.

The company also said it doesn’t use customer data to train Slack AI.

If you want to opt out, you cannot do so in a normal way, like through a checkbox. The workspace owner needs to send an email to a generic inbox with a specific subject line. Let me make it a little easier for you:

To: feedback@slack.com

Subject: Slack Global model opt-out request.

Body: Hey, your privacy principles are pretty confusing and feel sneaky. I am opting this workspace out of training your global model: [paste your workspace.slack.com address here]. This underhanded behaviour erodes my trust in your product. Have a pleasant day.

That ought to do the trick.

⌥ Permalink

Normal view