LED-based festive decorations are a fascinating subject for exploration of ingenuity in low-cost electronics. New products appear every year and often very surprising technology approaches are used to achieve some differentiation while adding minimal cost.
This year, there wasn’t any fancy new controller, but I was surprised how much the cost of simple light strings was reduced. The LED string above includes a small box with batteries and came in a set of ten for less than $2 shipped, so <$0.20 each. While I may have benefitted from promotional pricing, it is also clear that quite some work went into making the product cheap.
The string is constructed in the same way as one I had analyzed earlier: it uses phosphor-converted blue LEDs that are soldered to two insulated wires and covered with an epoxy blob. In contrast to the earlier device, they seem to have switched from copper wire to cheaper steel wires.
The interesting part is in the control box. It comes with three button cells, a small PCB, and a tactile button that turns the string on and cycles through different modes of flashing and and constant light.
Curiously, there is nothing on the PCB except the button and a device that looks like an LED. Also, note how some “redundant” joints have simply been left unsoldered.
Closer inspection reveals that the “LED” is actually a very small integrated circuit packaged in an LED package. The four pins are connected to the push button, the cathode of the LED string, and the power supply pins. I didn’t measure the die size exactly, but I estimate that it is smaller than 0.3×0.2 mm² = ~0.1 mm².
What is the purpose of packaging an IC in an LED package? Most likely, the company that made the light string is also packaging their own LEDs, and they saved costs by also packaging the IC themselves—in a package type they had available.
I characterized the current-voltage behavior of IC supply pins with the LED string connected. The LED string started to emit light at around 2.7V, which is consistent with the forward voltage of blue LEDs. The current increased proportionally to the voltage, which suggests that there is no current limit or constant current sink in the IC – it’s simply a switch with some series resistance.
Left: LED string in “constantly on” mode. Right: Flashing
Using an oscilloscope, I found that the string is modulated with an on-off ratio of 3:1 at a frequency if ~1.2 kHz. The image above shows the voltage at the cathode, the anode is connected to the positive supply. This is most likely to limit the current.
All in all, it is rather surprising to see an ASIC being used when it barely does more than flashing the LED string. It would have been nice to see a constant current source to stabilize the light levels over the lifetime of the battery and maybe more interesting light effects. But I guess that would have increased the cost of the ASIC too much and then using an ultra-low cost microcontroller may have been cheaper. This almost calls for a transplant of a MCU into this device…
Bouyed by the surprisingly good performance of neural networks with quantization aware training on the CH32V003, I wondered how far this can be pushed. How much can we compress a neural network while still achieving good test accuracy on the MNIST dataset? When it comes to absolutely low-end microcontrollers, there is hardly a more compelling target than the Padauk 8-bit microcontrollers. These are microcontrollers optimized for the simplest and lowest cost applications there are. The smallest device of the portfolio, the PMS150C, sports 1024 13-bit word one-time-programmable memory and 64 bytes of ram, more than an order of magnitude smaller than the CH32V003. In addition, it has a proprieteray accumulator based 8-bit architecture, as opposed to a much more powerful RISC-V instruction set.
Is it possible to implement an MNIST inference engine, which can classify handwritten numbers, also on a PMS150C?
On the CH32V003 I used MNIST samples that were downscaled from 28×28 to 16×16, so that every sample take 256 bytes of storage. This is quite acceptable if there is 16kb of flash available, but with only 1 kword of rom, this is too much. Therefore I started with downscaling the dataset to 8×8 pixels.
The image above shows a few samples from the dataset at both resolutions. At 16×16 it is still easy to discriminate different numbers. At 8×8 it is still possible to guess most numbers, but a lot of information is lost.
Suprisingly, it is still possible to train a machine learning model to recognize even these very low resolution numbers with impressive accuracy. It’s important to remember that the test dataset contains 10000 images that the model does not see during training. The only way for a very small model to recognize these images accurate is to identify common patterns, the model capacity is too limited to “remember” complete digits. I trained a number of different network combinations to understand the trade-off between network memory footprint and achievable accuracy.
Parameter Exploration
The plot above shows the result of my hyperparameter exploration experiments, comparing models with different configurations of weights and quantization levels from 1 to 4 bit for input images of 8×8 and 16×16. The smallest models had to be trained without data augmentation, as they would not converge otherwise.
Again, there is a clear relationship between test accuracy and the memory footprint of the network. Increasing the memory footprint improves accuracy up to a certain point. For 16×16, around 99% accuracy can be achieved at the upper end, while around 98.5% is achieved for 8×8 test samples. This is still quite impressive, considering the significant loss of information for 8×8.
For small models, 8×8 achieves better accuracy than 16×16. The reason for this is that the size of the first layer dominates in small models, and this size is reduced by a factor of 4 for 8×8 inputs.
Surprisingly, it is possible to achieve over 90% test accuracy even on models as small as half a kilobyte. This means that it would fit into the code memory of the microcontroller! Now that the general feasibility has been established, I needed to tweak things further to accommodate the limitations of the MCU.
Training the Target Model
Since the RAM is limited to 64 bytes, the model structure had to use a minimum number of latent parameters during inference. I found that it was possible to use layers as narrow as 16. This reduces the buffer size during inference to only 32 bytes, 16 bytes each for one input buffer and one output buffer, leaving 32 bytes for other variables. The 8×8 input pattern is directly read from the ROM.
Furthermore, I used 2-bit weights with irregular spacing of (-2, -1, 1, 2) to allow for a simplified implementation of the inference code. I also skipped layer normalization and instead used a constant shift to rescale activations. These changes slightly reduced accuracy. The resulting model structure is shown below.
All things considered, I ended up with a model with 90.07% accuracy and a total of 3392 bits (0.414 kilobytes) in 1696 weights, as shown in the log below. The panel on the right displays the first layer weights of the trained model, which directly mask features in the test images. In contrast to the higher accuracy models, each channel seems to combine many features at once, and no discernible patterns can be seen.
Implementation on the Microntroller
In the first iteration, I used a slightly larger variant of the Padauk Microcontrollers, the PFS154. This device has twice the ROM and RAM and can be reflashed, which tremendously simplifies software development. The C versions of the inference code, including the debug output, worked almost out of the box. Below, you can see the predictions and labels, including the last layer output.
Squeezing everything down to fit into the smaller PMS150C was a different matter. One major issue when programming these devices in C is that every function call consumes RAM for the return stack and function parameters. This is unavoidable because the architecture has only a single register (the accumulator), so all other operations must occur in RAM.
To solve this, I flattened the inference code and implemented the inner loop in assembly to optimize variable usage. The inner loop for memory-to-memory inference of one layer is shown below. The two-bit weight is multiplied with a four-bit activation in the accumulator and then added to a 16-bit register. The multiplication requires only four instructions (t0sn, sl,t0sn,neg), thanks to the powerful bit manipulation instructions of the architecture. The sign-extending addition (add, addc, sl, subc) also consists of four instructions, demonstrating the limitations of 8-bit architectures.
void fc_innerloop_mem(uint8_t loops) {
sum = 0;
do {
weightChunk = *weightidx++;
__asm
idxm a, _activations_idx
inc _activations_idx+0
t0sn _weightChunk, #6
sl a ; if (weightChunk & 0x40) in = in+in;
t0sn _weightChunk, #7
neg a ; if (weightChunk & 0x80) in =-in;
add _sum+0,a
addc _sum+1
sl a
subc _sum+1
... 3x more ...
__endasm;
} while (--loops);
int8_t sum8 = ((uint16_t)sum)>>3; // Normalization
sum8 = sum8 < 0 ? 0 : sum8; // ReLU
*output++ = sum8;
}
In the end, I managed to fit the entire inference code into 1 kilowords of memory and reduced sram usage to 59 bytes, as seen below. (Note that the output from SDCC is assuming 2 bytes per instruction word, while it is only 13 bits).
Success! Unfortunately, there was no rom space left for the soft UART to output debug information. However, based on the verificaiton on PFS154, I trust that the code works, and since I don’t have any specific application in mind, I left it at that stage.
Summary
It is indeed possible to implement MNIST inference with good accuracy using one of the cheapest and simplest microcontrollers on the market. A lot of memory footprint and processing overhead is usually spent on implementing flexible inference engines, that can accomodate a wide range of operators and model structures. Cutting this overhead away and reducing the functionality to its core allows for astonishing simplification at this very low end.
This hack demonstrates that there truly is no fundamental lower limit to applying machine learning and edge inference. However, the feasibility of implementing useful applications at this level is somewhat doubtful.
Bouyed by the surprisingly good performance of neural networks with quantization aware training on the CH32V003, I wondered how far this can be pushed. How much can we compress a neural network while still achieving good test accuracy on the MNIST dataset? When it comes to absolutely low-end microcontrollers, there is hardly a more compelling target than the Padauk 8-bit microcontrollers. These are microcontrollers optimized for the simplest and lowest cost applications there are. The smallest device of the portfolio, the PMS150C, sports 1024 13-bit word one-time-programmable memory and 64 bytes of ram, more than an order of magnitude smaller than the CH32V003. In addition, it has a proprieteray accumulator based 8-bit architecture, as opposed to a much more powerful RISC-V instruction set.
Is it possible to implement an MNIST inference engine, which can classify handwritten numbers, also on a PMS150C?
On the CH32V003 I used MNIST samples that were downscaled from 28×28 to 16×16, so that every sample take 256 bytes of storage. This is quite acceptable if there is 16kb of flash available, but with only 1 kword of rom, this is too much. Therefore I started with downscaling the dataset to 8×8 pixels.
The image above shows a few samples from the dataset at both resolutions. At 16×16 it is still easy to discriminate different numbers. At 8×8 it is still possible to guess most numbers, but a lot of information is lost.
Suprisingly, it is still possible to train a machine learning model to recognize even these very low resolution numbers with impressive accuracy. It’s important to remember that the test dataset contains 10000 images that the model does not see during training. The only way for a very small model to recognize these images accurate is to identify common patterns, the model capacity is too limited to “remember” complete digits. I trained a number of different network combinations to understand the trade-off between network memory footprint and achievable accuracy.
Parameter Exploration
The plot above shows the result of my hyperparameter exploration experiments, comparing models with different configurations of weights and quantization levels from 1 to 4 bit for input images of 8×8 and 16×16. The smallest models had to be trained without data augmentation, as they would not converge otherwise.
Again, there is a clear relationship between test accuracy and the memory footprint of the network. Increasing the memory footprint improves accuracy up to a certain point. For 16×16, around 99% accuracy can be achieved at the upper end, while around 98.5% is achieved for 8×8 test samples. This is still quite impressive, considering the significant loss of information for 8×8.
For small models, 8×8 achieves better accuracy than 16×16. The reason for this is that the size of the first layer dominates in small models, and this size is reduced by a factor of 4 for 8×8 inputs.
Surprisingly, it is possible to achieve over 90% test accuracy even on models as small as half a kilobyte. This means that it would fit into the code memory of the microcontroller! Now that the general feasibility has been established, I needed to tweak things further to accommodate the limitations of the MCU.
Training the Target Model
Since the RAM is limited to 64 bytes, the model structure had to use a minimum number of latent parameters during inference. I found that it was possible to use layers as narrow as 16. This reduces the buffer size during inference to only 32 bytes, 16 bytes each for one input buffer and one output buffer, leaving 32 bytes for other variables. The 8×8 input pattern is directly read from the ROM.
Furthermore, I used 2-bit weights with irregular spacing of (-2, -1, 1, 2) to allow for a simplified implementation of the inference code. I also skipped layer normalization and instead used a constant shift to rescale activations. These changes slightly reduced accuracy. The resulting model structure is shown below.
All things considered, I ended up with a model with 90.07% accuracy and a total of 3392 bits (0.414 kilobytes) in 1696 weights, as shown in the log below. The panel on the right displays the first layer weights of the trained model, which directly mask features in the test images. In contrast to the higher accuracy models, each channel seems to combine many features at once, and no discernible patterns can be seen.
Implementation on the Microntroller
In the first iteration, I used a slightly larger variant of the Padauk Microcontrollers, the PFS154. This device has twice the ROM and RAM and can be reflashed, which tremendously simplifies software development. The C versions of the inference code, including the debug output, worked almost out of the box. Below, you can see the predictions and labels, including the last layer output.
Squeezing everything down to fit into the smaller PMS150C was a different matter. One major issue when programming these devices in C is that every function call consumes RAM for the return stack and function parameters. This is unavoidable because the architecture has only a single register (the accumulator), so all other operations must occur in RAM.
To solve this, I flattened the inference code and implemented the inner loop in assembly to optimize variable usage. The inner loop for memory-to-memory inference of one layer is shown below. The two-bit weight is multiplied with a four-bit activation in the accumulator and then added to a 16-bit register. The multiplication requires only four instructions (t0sn, sl,t0sn,neg), thanks to the powerful bit manipulation instructions of the architecture. The sign-extending addition (add, addc, sl, subc) also consists of four instructions, demonstrating the limitations of 8-bit architectures.
void fc_innerloop_mem(uint8_t loops) {
sum = 0;
do {
weightChunk = *weightidx++;
__asm
idxm a, _activations_idx
inc _activations_idx+0
t0sn _weightChunk, #6
sl a ; if (weightChunk & 0x40) in = in+in;
t0sn _weightChunk, #7
neg a ; if (weightChunk & 0x80) in =-in;
add _sum+0,a
addc _sum+1
sl a
subc _sum+1
... 3x more ...
__endasm;
} while (--loops);
int8_t sum8 = ((uint16_t)sum)>>3; // Normalization
sum8 = sum8 < 0 ? 0 : sum8; // ReLU
*output++ = sum8;
}
In the end, I managed to fit the entire inference code into 1 kilowords of memory and reduced sram usage to 59 bytes, as seen below. (Note that the output from SDCC is assuming 2 bytes per instruction word, while it is only 13 bits).
Success! Unfortunately, there was no rom space left for the soft UART to output debug information. However, based on the verificaiton on PFS154, I trust that the code works, and since I don’t have any specific application in mind, I left it at that stage.
Summary
It is indeed possible to implement MNIST inference with good accuracy using one of the cheapest and simplest microcontrollers on the market. A lot of memory footprint and processing overhead is usually spent on implementing flexible inference engines, that can accomodate a wide range of operators and model structures. Cutting this overhead away and reducing the functionality to its core allows for astonishing simplification at this very low end.
This hack demonstrates that there truly is no fundamental lower limit to applying machine learning and edge inference. However, the feasibility of implementing useful applications at this level is somewhat doubtful.
The CH32V203 is a 32bit RISC-V microcontroller. In the produt portfolio of WCH it is the next step up from the CH32V003, sporting a much higher clock rate of 144 MHz and a more powerful RISC-V core with RV32IMAC instruction set architecture. The CH32V203 is also extremely affordable, starting at around 0.40 USD (>100 bracket), depending on configuration.
An interesting remark on twitter piqued my interest: Supposedly the listed flash memory size only refers to a fraction that can be accessed with zero waitstate, while the total flash size is even 224kb. The datasheet indeed has a footnote claiming the same. In addition, the RB variant offers the option to reconfigure between RAM and flash, which is rather odd, considering that writing to flash is usually much slower than to RAM.
Then the 224kb number is mentioned in the memory map. Besides the code flash, there is also a 28Kb boot section and additional configurable space. 224 kbyte +28 kbyte+4=256kbyte, which suggests that the total available flash is 256 kbyte and is remapped to different locations of the memory.
All of these are red flags for an architecture where a separate NOR flash die is used to store the code and the main CPU core has a small SRAM that is used as a cache. This configuration was pioneered by Gigadevice and is also famously used by the ESP32 and RP2040 more recently, although that latter two use an external NOR flash device.
Flash memory is quite different from normal CMOS devices as it requires a special gate stack, isolation and much higher voltages. Therefore, integrating flash memory into a CMOS logic die usually requires extra process steps. The added complexity increases when going to smaller technologies nodes. Separating both dies offers the option of using a high density logic process (for example 45 nm) and pairing it with a low-cost off-the-shell NOR flash die.
Decapsulation and Die Images
To confirm my suspicions I decapsulated a CH32V203C8T6 sample, shown above. I heated the package to drive out the resin and then carefully broke the, now brittle, package apart. Already after removing the lead frame, we can cleary see that it contains two dies.
The small die is around 0.5mm² in area. I wasn’t able to completely removed the remaining filler, but we can see that it is an IC with a smaller number of pads, fitting to a serial flash die.
The microcontroller die came out really well. Unfortunately, the photos below are severely limited by my low-cost USB microscope. I hope Zeptobars or others will come up with nicer images at some point.
The die size of ~1.8 mm² is surprisingly small. In fact it is even smaller than the die of the CH32V003 with a die size of ~2.0 mm² according to Zeptobars die shot. Apart from the fact that the flash was moved off-chip, most likely also a much smaller CMOS technology node was used for the CH32V203 than for the V003.
Summary
It was quite surprising to find a two-die configuration in such a low-cost device. But obviously, it explains the oddities in the device specification, and it also explains why 144 MHz core clock is possible in this device without wait-states.
What are the repercussions?
Amazingly, it seems that, instead of only 32kb of flash, as listed for the smallest device, a total of 224kb can be used for code and data storage. The datasheet mentions a special “flash enhanced read mode” that can apparently be used to execute code from the extended flash space. It’s not entirely clear what the impact on speed is, though, but that’s certainly an area for exploration.
I also expect this MCU to be highly overclockable, similar to the RP2040.
hCaptcha is a reCAPTCHA clone that has been growing in popularity over 2020 and 2021, in particular due to Cloudflare’s conversion of their nag screens from Google’s reCAPTCHA to hCaptcha. Although hCaptcha advertises itself as being a privacy-conscious alternative to reCAPTCHA, there’s also an incentive for websites to switch over: hCaptcha will pay websites each time one of their users completes a hCaptcha challenge.
Now the question is: how does you completing a captcha earn anyone money? Of course, hCaptcha is a VC-funded business, so it can afford to burn money in the pursuit of market share; nonetheless there needs to be a plausible business model there, and it’s not obvious at first sight.
If you read the hCaptcha website, they suggest that AI startups will pay them to label their images for them. 1 Labelling images is a labour-intensive task and required for some current-generation machine learning approaches. AI startups are well-funded and have money to spend on labelling, so this sounds like a reasonable case of selling shovels during a gold rush. But the output from solving CAPTCHAs isn’t obviously isomorphic to the type of labelling required for machine learning, which is often quite specific and requires a very low error rate.
Complex CAPTCHA challenges are not possible, as web users turn out to be drunk, blind, 3 years old, or just randomly clicking buttons to get this infernal thing to go away. Accordingly, hCaptcha challenges are simple: select the images that match a simple 1-3 word prompt from a 3x3 grid. This is fortunately easy for most real people. 23
The most common prompts seem to be selecting buses, trucks, boats or trains out of the grid.4 The market demand for this sort of simple labelling must be rather limited, even if challenges have to be repeated many times and cross-checked to get an acceptable error rate.
So far, a little inscrutable but all seems sensible enough. But then it all gets interesting when you actually take a look at the images in a little more detail:
Starting from the top left and going right, we have:
A boat that appears to have been painted by Dalí, with a mast drooping like a wet noodle.
A plane with tricycle landing gear, except it’s got two sets of wheels at the front and one at the back. That’s not normal!
A normal looking plane with some odd-looking clouds above.
A bus with an axle in front of the door, and another behind it, and another at the back. Hmm
A boat in a marina made of splodges.
A normal-looking boat on a normal-looking sea, except - look at that horizon! How did that happen.
A single-decker london bus with a ghost of it’s double-decker cousin above. And a giant moth perched on it at the back.
Another ghostly upper deck on a regional bus.
A sailing boat with some oddly stylised “alien” writing on the sail.
These images are obviously AI-generated. They have all the hallmarks of GAN output, with typical artifacts and oddities. Havesomemore and see if you can spot the same things in these other challenges - it’s not hard at all, is it!
The question then is why? Why would hCaptcha be generating these challenges - aren’t they supposed to be labelling real life, not some AI mirages? You know the labels before you generate them, what’s the point in using humans to re-label them again… And why are the results so bad - these are definitely not state of the art!
The only explanation that makes sense is that hCaptcha is not really doing this whole AI-labelling business at all, or if they are it’s only in very limited fashion. Most of the time they’re just using a GAN to generate images that defeat the bots’ image recognition AI. And the GAN isn’t trained to optimise human recognition, rather to confound the bots in an arms race, leading to the bad image quality.
If you have any better ideas I’d be glad to hear them because this whole thing doesn’t really make much sense.
Footnotes:
If you look closer, they have an article that purports to explain the “technical architecture of hCaptcha” which is a supreme example of buzzword-stuffing blockchain-washed nothing. There is less than zero need for a blockchain to track customer requests, much less the public Ethereum blockchain, but it’s the buzzword of the month so it must go in. ↩
Most real users, that is. There are some users for whom the challenge is actually too hard, or who’ve been blackholed and are interpreting bad IP reputation as poor skill. But the ones who fall down most often are those who try too hard and analyse the prompt and challenge in too much detail. The real way to solve these image challenges is to answer what you think other people will answer, rather than the correct answer. And don’t take too long either, just a quick glance is all your competition are giving! Anecdotally, this isn’t too common with hCaptcha, but reCAPTCHA challenges are extremely prone to this failure if you think too hard. ↩
Unfortunately this is also quite easy for bots, somewhat subverting the point of a CAPTCHA, so that’s how browser fingerprinting and IP reputation creep in to get reasonable enough results. ↩
These prompts are so common that a front-page post on Hacker News consisted of this observation (and prompted me to write up my thoughts on the topic from the past few months). ↩
Search engines are a fact of daily life for most of the population nowadays. Google (sub your preferred provider) is an extension of the brain, imagined as giving you access to the sum of the world’s information at the click of a button. But a search engine isn’t just a Ctrl-F for the internet with a nice interface and ads; rather it’s a tremendously complicated system with lots of features and interactions between those features. And all you need to explore the system yourself is some well-tuned search queries.
I recently had an epiphany: search engines are designed to find you results for something and that’s a job they perform well. But there’s nothing stopping you from searching for nothing! And the search engines will still give you results!
And what results they are - have a go on the links below:
And have you ever thought about doing an anything but search? Normally you can add negations to the end of your search term to remove unwanted results, but there’s nothing stopping you from having a search term consisting entirely of negations!
Google appears to have some half-effective filtering for these empty search queries so you’ll mostly get the same two YouTube videos as a result - is this an Easter egg? Although Google News and Books don’t have any filter, and you do get some odd results there!
DuckDuckGo doesn’t appear to have any filtering at all, although it’s obvious just how much DDG relies on Bing’s whitelabel product for its results by looking at how similar the two are.
If you can think of a deeper reason for these results, please do leave a comment and lets try and explain some of the mystery away.