bearblog
YouTube Comment Spam and the Linking Non-Link 22 May 2022 at 21:30

YouTube Comment Spam and the Linking Non-Link

22 May 2022 at 21:30

As somebody who quite often scrolls down to the comments section on YouTube (for my sins), I’m often faced with the innovative tactics that spammers use to get their comments through Google’s automated filtering and detection systems.

One method that’s somehow persisted for a few years is the linking non-link. YouTube doesn’t like comments with links in them and will often silently hide or sometimes outright refuse them. But the link-detection algorithm used in spam filtering isn’t the same one that’s used in the frontend to convert textual links into clickable hyperlinks!

So spammers craft their comments with unusual TLDs and mixed into normal-looking text. They aren’t detected as links by the filter, but their marks can still click on them as normal! This has been going on for quite a while with different iterations of TLDs and filters, but somehow Google hasn’t managed to stamp it out quite yet.

Here’s an example (Spanish language) spam comment, where the .uno domain is the bait:

spanish spam comment by a cyrillic account

I often wonder if there’s any real development effort put into spam filtering from the comment side; my guess is that as a centralised platform YouTube puts a lot more emphasis on filtering out spammy accounts. Some of my creator friends often complain about the lack of moderation tools to keep their comment sections clean - beyond deleting comments individually and the “naughty words list” that automatically hides comments there’s not really that much a creator can do.

bearblog
What are you really doing when you fill in an hCaptcha 8 January 2022 at 01:00

What are you really doing when you fill in an hCaptcha

bearblog

8 January 2022 at 01:00

hCaptcha is a reCAPTCHA clone that has been growing in popularity over 2020 and 2021, in particular due to Cloudflare’s conversion of their nag screens from Google’s reCAPTCHA to hCaptcha. Although hCaptcha advertises itself as being a privacy-conscious alternative to reCAPTCHA, there’s also an incentive for websites to switch over: hCaptcha will pay websites each time one of their users completes a hCaptcha challenge.

Now the question is: how does you completing a captcha earn anyone money? Of course, hCaptcha is a VC-funded business, so it can afford to burn money in the pursuit of market share; nonetheless there needs to be a plausible business model there, and it’s not obvious at first sight.

If you read the hCaptcha website, they suggest that AI startups will pay them to label their images for them. ¹ Labelling images is a labour-intensive task and required for some current-generation machine learning approaches. AI startups are well-funded and have money to spend on labelling, so this sounds like a reasonable case of selling shovels during a gold rush. But the output from solving CAPTCHAs isn’t obviously isomorphic to the type of labelling required for machine learning, which is often quite specific and requires a very low error rate.

Complex CAPTCHA challenges are not possible, as web users turn out to be drunk, blind, 3 years old, or just randomly clicking buttons to get this infernal thing to go away. Accordingly, hCaptcha challenges are simple: select the images that match a simple 1-3 word prompt from a 3x3 grid. This is fortunately easy for most real people. ² ³

The most common prompts seem to be selecting buses, trucks, boats or trains out of the grid.⁴ The market demand for this sort of simple labelling must be rather limited, even if challenges have to be repeated many times and cross-checked to get an acceptable error rate.

So far, a little inscrutable but all seems sensible enough. But then it all gets interesting when you actually take a look at the images in a little more detail:

hCaptcha example

Starting from the top left and going right, we have:

A boat that appears to have been painted by Dalí, with a mast drooping like a wet noodle.
A plane with tricycle landing gear, except it’s got two sets of wheels at the front and one at the back. That’s not normal!
A normal looking plane with some odd-looking clouds above.
A bus with an axle in front of the door, and another behind it, and another at the back. Hmm
A boat in a marina made of splodges.
A normal-looking boat on a normal-looking sea, except - look at that horizon! How did that happen.
A single-decker london bus with a ghost of it’s double-decker cousin above. And a giant moth perched on it at the back.
Another ghostly upper deck on a regional bus.
A sailing boat with some oddly stylised “alien” writing on the sail.

These images are obviously AI-generated. They have all the hallmarks of GAN output, with typical artifacts and oddities. Have some more and see if you can spot the same things in these other challenges - it’s not hard at all, is it!

The question then is why? Why would hCaptcha be generating these challenges - aren’t they supposed to be labelling real life, not some AI mirages? You know the labels before you generate them, what’s the point in using humans to re-label them again… And why are the results so bad - these are definitely not state of the art!

The only explanation that makes sense is that hCaptcha is not really doing this whole AI-labelling business at all, or if they are it’s only in very limited fashion. Most of the time they’re just using a GAN to generate images that defeat the bots’ image recognition AI. And the GAN isn’t trained to optimise human recognition, rather to confound the bots in an arms race, leading to the bad image quality.

If you have any better ideas I’d be glad to hear them because this whole thing doesn’t really make much sense.

Footnotes:

If you look closer, they have an article that purports to explain the “technical architecture of hCaptcha” which is a supreme example of buzzword-stuffing blockchain-washed nothing. There is less than zero need for a blockchain to track customer requests, much less the public Ethereum blockchain, but it’s the buzzword of the month so it must go in. ↩
Most real users, that is. There are some users for whom the challenge is actually too hard, or who’ve been blackholed and are interpreting bad IP reputation as poor skill. But the ones who fall down most often are those who try too hard and analyse the prompt and challenge in too much detail. The real way to solve these image challenges is to answer what you think other people will answer, rather than the correct answer. And don’t take too long either, just a quick glance is all your competition are giving! Anecdotally, this isn’t too common with hCaptcha, but reCAPTCHA challenges are extremely prone to this failure if you think too hard. ↩
Unfortunately this is also quite easy for bots, somewhat subverting the point of a CAPTCHA, so that’s how browser fingerprinting and IP reputation creep in to get reasonable enough results. ↩
These prompts are so common that a front-page post on Hacker News consisted of this observation (and prompted me to write up my thoughts on the topic from the past few months). ↩

bearblog
Searching for Nothing, Finding a Surprise 23 December 2021 at 21:30

Searching for Nothing, Finding a Surprise

bearblog

23 December 2021 at 21:30

Following on from my post yesterday about an edge case in YouTube, I thought I’d write about a class of edge cases perhaps even more strange that I’ve been exploring recently:

Search engines are a fact of daily life for most of the population nowadays. Google (sub your preferred provider) is an extension of the brain, imagined as giving you access to the sum of the world’s information at the click of a button. But a search engine isn’t just a Ctrl-F for the internet with a nice interface and ads; rather it’s a tremendously complicated system with lots of features and interactions between those features. And all you need to explore the system yourself is some well-tuned search queries.

I recently had an epiphany: search engines are designed to find you results for something and that’s a job they perform well. But there’s nothing stopping you from searching for nothing! And the search engines will still give you results!

And what results they are - have a go on the links below:

An empty query on DDG: https://duckduckgo.com/?q=+””
A different empty query on DDG: https://duckduckgo.com/?q=(“”)
An empty query on Google: https://www.google.com/search?q=(“”)
An empty query on Google News: https://www.google.com/search?q=”“&tbm=nws

And have you ever thought about doing an anything but search? Normally you can add negations to the end of your search term to remove unwanted results, but there’s nothing stopping you from having a search term consisting entirely of negations!

Here’s one on DDG: https://duckduckgo.com/?q=-“an entirely negated query”
On Bing: https://www.bing.com/search?q=-“an entirely negated query”
And on Google Books: https://www.google.com/search?q=-“nothing to see here”&tbm=bks

Commentary

Google appears to have some half-effective filtering for these empty search queries so you’ll mostly get the same two YouTube videos as a result - is this an Easter egg? Although Google News and Books don’t have any filter, and you do get some odd results there!

DuckDuckGo doesn’t appear to have any filtering at all, although it’s obvious just how much DDG relies on Bing’s whitelabel product for its results by looking at how similar the two are.

If you can think of a deeper reason for these results, please do leave a comment and lets try and explain some of the mystery away.

bearblog
Drinking From the (Musical) Firehose on YouTube 22 December 2021 at 23:30

Drinking From the (Musical) Firehose on YouTube

bearblog

22 December 2021 at 23:30

Nowadays YouTube is a great place to listen to music, because everything is there. There’s such a wide selection of to listen to - seriously - the permissive ask-for-forgiveness¹ bazaar means that if you search for it, it’ll be there. Make your own playlist, and when it’s time to add something new to you, it’ll be there. Alternatively, just be guided by the flow and don’t worry about where it’s all coming from.

And to that point, discovery is where YouTube really excels - The Algorithm knows what genres you like, and what you’ve listened to before, and there’ll always be an old favourite ready to listen again or something new, but familiar, to experience for the first time. Training time is minimal, because The Algorithm is a simple beast really (do you really think AlphaGooYou is going to waste resources on a complex model).

That said, sometimes you just want a change, and it’s hard to switch off completely. If you log out and clear your cookies, you’ll get music, sure; but it’ll be the worst dregs of contemporary nongenre, optimised for the dying radio sector. Not worth it! What you need is a quick way to jump out of your filter bubble: a random mode, a shuffle play, to say. And floating there in the aether, an odd edge case at the margins of the beast, it actually exists:

Here it is, the snappily named: “Uploads from Various Artists - Topic” Playlist. 20000 entries, all songs just recently uploaded to YouTube in the past week or so. Go ahead: break into a brand new song with 0 lifetime views!, Enjoy a random cyrillic-lettered song you can’t understand!, Use it as an infinite radio - whole new songs being added faster than you can listen to them!

Although I don’t completely understand why this exists, it seems to be a quirk in the YouTube partner music upload programme: music rightsholders (or those who purport to be) can upload music to YouTube² in bulk and these are arranged into “Topic Channels” for each artist. These “Channels” inhabit the half-space between a real channel and a playlist - you can subscribe but there’s no real person on the other side of the curtain; certainly there’s no community there. And it seems, on one end or the other, that in the absence of any better information everything just gets unceremoniously dumped into the “Uploads from Various Artists - Topic” topic channel playlist.

Either way, it may be quirk, and an odd one at that; but it’s fun and it should be saved. Please don’t take it away, oh wondrous BigTech…

Footnotes

For all the perils of YouTube’s arbitrary Copyright system, the variety of music it allows is certainly a benefit. When videos are allowed by default, and the normal punishment after detection of your copyright infringement is a few cents from ads going to the labels, you get channels like ultradiskopanorama uploading rare classics that were never going to go on a service like Spotify. ↩
These videos always have “Auto-generated by YouTube” in the description, and all have their comments turned off (sadly a recent change). ↩

Normal view

Commentary

Footnotes