Reading view

There are new articles available, click to refresh the page.

Neural Networks (MNIST inference) on the “3-cent” Microcontroller

By: cpldcpu

Bouyed by the surprisingly good performance of neural networks with quantization aware training on the CH32V003, I wondered how far this can be pushed. How much can we compress a neural network while still achieving good test accuracy on the MNIST dataset? When it comes to absolutely low-end microcontrollers, there is hardly a more compelling target than the Padauk 8-bit microcontrollers. These are microcontrollers optimized for the simplest and lowest cost applications there are. The smallest device of the portfolio, the PMS150C, sports 1024 13-bit word one-time-programmable memory and 64 bytes of ram, more than an order of magnitude smaller than the CH32V003. In addition, it has a proprieteray accumulator based 8-bit architecture, as opposed to a much more powerful RISC-V instruction set.

Is it possible to implement an MNIST inference engine, which can classify handwritten numbers, also on a PMS150C?

On the CH32V003 I used MNIST samples that were downscaled from 28×28 to 16×16, so that every sample take 256 bytes of storage. This is quite acceptable if there is 16kb of flash available, but with only 1 kword of rom, this is too much. Therefore I started with downscaling the dataset to 8×8 pixels.

The image above shows a few samples from the dataset at both resolutions. At 16×16 it is still easy to discriminate different numbers. At 8×8 it is still possible to guess most numbers, but a lot of information is lost.

Suprisingly, it is still possible to train a machine learning model to recognize even these very low resolution numbers with impressive accuracy. It’s important to remember that the test dataset contains 10000 images that the model does not see during training. The only way for a very small model to recognize these images accurate is to identify common patterns, the model capacity is too limited to “remember” complete digits. I trained a number of different network combinations to understand the trade-off between network memory footprint and achievable accuracy.

Parameter Exploration

The plot above shows the result of my hyperparameter exploration experiments, comparing models with different configurations of weights and quantization levels from 1 to 4 bit for input images of 8×8 and 16×16. The smallest models had to be trained without data augmentation, as they would not converge otherwise.

Again, there is a clear relationship between test accuracy and the memory footprint of the network. Increasing the memory footprint improves accuracy up to a certain point. For 16×16, around 99% accuracy can be achieved at the upper end, while around 98.5% is achieved for 8×8 test samples. This is still quite impressive, considering the significant loss of information for 8×8.

For small models, 8×8 achieves better accuracy than 16×16. The reason for this is that the size of the first layer dominates in small models, and this size is reduced by a factor of 4 for 8×8 inputs.

Surprisingly, it is possible to achieve over 90% test accuracy even on models as small as half a kilobyte. This means that it would fit into the code memory of the microcontroller! Now that the general feasibility has been established, I needed to tweak things further to accommodate the limitations of the MCU.

Training the Target Model

Since the RAM is limited to 64 bytes, the model structure had to use a minimum number of latent parameters during inference. I found that it was possible to use layers as narrow as 16. This reduces the buffer size during inference to only 32 bytes, 16 bytes each for one input buffer and one output buffer, leaving 32 bytes for other variables. The 8×8 input pattern is directly read from the ROM.

Furthermore, I used 2-bit weights with irregular spacing of (-2, -1, 1, 2) to allow for a simplified implementation of the inference code. I also skipped layer normalization and instead used a constant shift to rescale activations. These changes slightly reduced accuracy. The resulting model structure is shown below.

All things considered, I ended up with a model with 90.07% accuracy and a total of 3392 bits (0.414 kilobytes) in 1696 weights, as shown in the log below. The panel on the right displays the first layer weights of the trained model, which directly mask features in the test images. In contrast to the higher accuracy models, each channel seems to combine many features at once, and no discernible patterns can be seen.

Implementation on the Microntroller

In the first iteration, I used a slightly larger variant of the Padauk Microcontrollers, the PFS154. This device has twice the ROM and RAM and can be reflashed, which tremendously simplifies software development. The C versions of the inference code, including the debug output, worked almost out of the box. Below, you can see the predictions and labels, including the last layer output.

Squeezing everything down to fit into the smaller PMS150C was a different matter. One major issue when programming these devices in C is that every function call consumes RAM for the return stack and function parameters. This is unavoidable because the architecture has only a single register (the accumulator), so all other operations must occur in RAM.

To solve this, I flattened the inference code and implemented the inner loop in assembly to optimize variable usage. The inner loop for memory-to-memory inference of one layer is shown below. The two-bit weight is multiplied with a four-bit activation in the accumulator and then added to a 16-bit register. The multiplication requires only four instructions (t0sn, sl,t0sn,neg), thanks to the powerful bit manipulation instructions of the architecture. The sign-extending addition (add, addc, sl, subc) also consists of four instructions, demonstrating the limitations of 8-bit architectures.

void fc_innerloop_mem(uint8_t loops) {

    sum = 0;
    do  {
       weightChunk = *weightidx++;
__asm   
    idxm  a, _activations_idx
	inc	_activations_idx+0

    t0sn _weightChunk, #6
    sl     a            ;    if (weightChunk & 0x40) in = in+in;
    t0sn _weightChunk, #7
    neg    a           ;     if (weightChunk & 0x80) in =-in;                    

    add    _sum+0,a
    addc   _sum+1
    sl     a 
    subc   _sum+1  

  ... 3x more ...

__endasm;
    } while (--loops);

    int8_t sum8 = ((uint16_t)sum)>>3; // Normalization
    sum8 = sum8 < 0 ? 0 : sum8; // ReLU
    *output++ = sum8;
}

In the end, I managed to fit the entire inference code into 1 kilowords of memory and reduced sram usage to 59 bytes, as seen below. (Note that the output from SDCC is assuming 2 bytes per instruction word, while it is only 13 bits).

Success! Unfortunately, there was no rom space left for the soft UART to output debug information. However, based on the verificaiton on PFS154, I trust that the code works, and since I don’t have any specific application in mind, I left it at that stage.

Summary

It is indeed possible to implement MNIST inference with good accuracy using one of the cheapest and simplest microcontrollers on the market. A lot of memory footprint and processing overhead is usually spent on implementing flexible inference engines, that can accomodate a wide range of operators and model structures. Cutting this overhead away and reducing the functionality to its core allows for astonishing simplification at this very low end.

This hack demonstrates that there truly is no fundamental lower limit to applying machine learning and edge inference. However, the feasibility of implementing useful applications at this level is somewhat doubtful.

You can find the project repository here.

Neural Networks (MNIST inference) on the “3-cent” Microcontroller

By: cpldcpu

Bouyed by the surprisingly good performance of neural networks with quantization aware training on the CH32V003, I wondered how far this can be pushed. How much can we compress a neural network while still achieving good test accuracy on the MNIST dataset? When it comes to absolutely low-end microcontrollers, there is hardly a more compelling target than the Padauk 8-bit microcontrollers. These are microcontrollers optimized for the simplest and lowest cost applications there are. The smallest device of the portfolio, the PMS150C, sports 1024 13-bit word one-time-programmable memory and 64 bytes of ram, more than an order of magnitude smaller than the CH32V003. In addition, it has a proprieteray accumulator based 8-bit architecture, as opposed to a much more powerful RISC-V instruction set.

Is it possible to implement an MNIST inference engine, which can classify handwritten numbers, also on a PMS150C?

On the CH32V003 I used MNIST samples that were downscaled from 28×28 to 16×16, so that every sample take 256 bytes of storage. This is quite acceptable if there is 16kb of flash available, but with only 1 kword of rom, this is too much. Therefore I started with downscaling the dataset to 8×8 pixels.

The image above shows a few samples from the dataset at both resolutions. At 16×16 it is still easy to discriminate different numbers. At 8×8 it is still possible to guess most numbers, but a lot of information is lost.

Suprisingly, it is still possible to train a machine learning model to recognize even these very low resolution numbers with impressive accuracy. It’s important to remember that the test dataset contains 10000 images that the model does not see during training. The only way for a very small model to recognize these images accurate is to identify common patterns, the model capacity is too limited to “remember” complete digits. I trained a number of different network combinations to understand the trade-off between network memory footprint and achievable accuracy.

Parameter Exploration

The plot above shows the result of my hyperparameter exploration experiments, comparing models with different configurations of weights and quantization levels from 1 to 4 bit for input images of 8×8 and 16×16. The smallest models had to be trained without data augmentation, as they would not converge otherwise.

Again, there is a clear relationship between test accuracy and the memory footprint of the network. Increasing the memory footprint improves accuracy up to a certain point. For 16×16, around 99% accuracy can be achieved at the upper end, while around 98.5% is achieved for 8×8 test samples. This is still quite impressive, considering the significant loss of information for 8×8.

For small models, 8×8 achieves better accuracy than 16×16. The reason for this is that the size of the first layer dominates in small models, and this size is reduced by a factor of 4 for 8×8 inputs.

Surprisingly, it is possible to achieve over 90% test accuracy even on models as small as half a kilobyte. This means that it would fit into the code memory of the microcontroller! Now that the general feasibility has been established, I needed to tweak things further to accommodate the limitations of the MCU.

Training the Target Model

Since the RAM is limited to 64 bytes, the model structure had to use a minimum number of latent parameters during inference. I found that it was possible to use layers as narrow as 16. This reduces the buffer size during inference to only 32 bytes, 16 bytes each for one input buffer and one output buffer, leaving 32 bytes for other variables. The 8×8 input pattern is directly read from the ROM.

Furthermore, I used 2-bit weights with irregular spacing of (-2, -1, 1, 2) to allow for a simplified implementation of the inference code. I also skipped layer normalization and instead used a constant shift to rescale activations. These changes slightly reduced accuracy. The resulting model structure is shown below.

All things considered, I ended up with a model with 90.07% accuracy and a total of 3392 bits (0.414 kilobytes) in 1696 weights, as shown in the log below. The panel on the right displays the first layer weights of the trained model, which directly mask features in the test images. In contrast to the higher accuracy models, each channel seems to combine many features at once, and no discernible patterns can be seen.

Implementation on the Microntroller

In the first iteration, I used a slightly larger variant of the Padauk Microcontrollers, the PFS154. This device has twice the ROM and RAM and can be reflashed, which tremendously simplifies software development. The C versions of the inference code, including the debug output, worked almost out of the box. Below, you can see the predictions and labels, including the last layer output.

Squeezing everything down to fit into the smaller PMS150C was a different matter. One major issue when programming these devices in C is that every function call consumes RAM for the return stack and function parameters. This is unavoidable because the architecture has only a single register (the accumulator), so all other operations must occur in RAM.

To solve this, I flattened the inference code and implemented the inner loop in assembly to optimize variable usage. The inner loop for memory-to-memory inference of one layer is shown below. The two-bit weight is multiplied with a four-bit activation in the accumulator and then added to a 16-bit register. The multiplication requires only four instructions (t0sn, sl,t0sn,neg), thanks to the powerful bit manipulation instructions of the architecture. The sign-extending addition (add, addc, sl, subc) also consists of four instructions, demonstrating the limitations of 8-bit architectures.

void fc_innerloop_mem(uint8_t loops) {

    sum = 0;
    do  {
       weightChunk = *weightidx++;
__asm   
    idxm  a, _activations_idx
	inc	_activations_idx+0

    t0sn _weightChunk, #6
    sl     a            ;    if (weightChunk & 0x40) in = in+in;
    t0sn _weightChunk, #7
    neg    a           ;     if (weightChunk & 0x80) in =-in;                    

    add    _sum+0,a
    addc   _sum+1
    sl     a 
    subc   _sum+1  

  ... 3x more ...

__endasm;
    } while (--loops);

    int8_t sum8 = ((uint16_t)sum)>>3; // Normalization
    sum8 = sum8 < 0 ? 0 : sum8; // ReLU
    *output++ = sum8;
}

In the end, I managed to fit the entire inference code into 1 kilowords of memory and reduced sram usage to 59 bytes, as seen below. (Note that the output from SDCC is assuming 2 bytes per instruction word, while it is only 13 bits).

Success! Unfortunately, there was no rom space left for the soft UART to output debug information. However, based on the verificaiton on PFS154, I trust that the code works, and since I don’t have any specific application in mind, I left it at that stage.

Summary

It is indeed possible to implement MNIST inference with good accuracy using one of the cheapest and simplest microcontrollers on the market. A lot of memory footprint and processing overhead is usually spent on implementing flexible inference engines, that can accomodate a wide range of operators and model structures. Cutting this overhead away and reducing the functionality to its core allows for astonishing simplification at this very low end.

This hack demonstrates that there truly is no fundamental lower limit to applying machine learning and edge inference. However, the feasibility of implementing useful applications at this level is somewhat doubtful.

You can find the project repository here.

Ultra Low Power LED Flasher using the Padauk PFS154

By: cpldcpu

Flashing a LED is certainly among the first set of problems any burgeoning electronics specialist is tackling, may it be by using an ancient NE555 or, more recently, a microcontroller to control the LED. As it turns out, we can turn any trivial problem into a harder one by changing its constraints.

So, what about the challenge of flashing a LED from the charge of a single battery as long as possible? Of course, also this is not a novel problem. Two interesting approaches that I came across in the past: 1) Burkhard Kainkas Ewiger Blinker (“Eternal Blinky”) and 2) Ted Yapos TritiLED.

B. Kainkas project is a n LED flasher circuit made from discrete transistors that consumes about 50µA and is able to run for years from a single AA cell. Ted Yapo raised the bar a bit further and investigated in intricate detail how to make a LED shine for years at very low intensity from a CR2032 coin cell. His project logs are worth a read for certain. One very interesting detail is that he concluded that using a lower power microcontroller to control the LED is actually the most efficient option. This may be a bit counterintuitive, but appears more obvious when looking at his attempts of building a discrete version.

Many microcontrollers offer highly optimized low power sleep-modes that can be used to wait between the flashes. The microcontroller will only be active when the LED needs to be flashed. At that point it does not really matter how much the active power consumption of the microcontroller is, because the LED will need several mA to emit light at a sufficient level.

Enter the infamous “3 cent” Padauk microcontroller family that I used for several projects before. To my surprise, these devices offer very competitive low power sleeping modes that seems to be on par with several “low power” 8 bit microcontrollers that cost ten times as much. I investigated how to implement an ultra low power LED flasher on the PFS154.

Implementation

The first step in reducing the power consumption of the MCU is to use the low speed oscillator as a clock source. In the PFS154 this is called the “ILRC” and provides a clock of around 52 kHz depending on supply voltage. One oddity I found is that it was necessary to activate both the high-speed and low-speed oscillator as a first step and only disable the high speed oscillator in a second step. Directly switching to the ILRC halted the MCU. The code example below is based on the free-pdk includes.

/*   Activate low frequency oscillator as main clock. */
CLKMD =  CLKMD_ILRC | CLKMD_ENABLE_ILRC | CLKMD_ENABLE_IHRC; 
CLKMD =  CLKMD_ILRC | CLKMD_ENABLE_ILRC ; 
    // Note: it is important to turn off IHRC only after clock 
    // settings have been updated. Otherwise the CPU stalls.

Running the PFS154 at such a low clock will already reduce the power consumption to far below 100 µA. Not all of this is dynamic power consumption that scales with the clock rate, so the only way to go further is to activate one of the sleep modes.

Sleep modes

The PFS154 supports two sleep modes: “STOPSYS” and “STOPEXE”.

STOPSYS completely halts the core and all oscillators. The only way to wake up from this state is by a pin change.

STOPEXE halts the core, but the low frequency oscillator remains active and can be used to clock the timers. The core can be woken up by pin change or timer event. It appears that, although not clearly states in the datasheet, both the 8 bit timers and the 16 bit timer can generate wake-up events. Note that the watchdog timer is also halted during STOPEXE. This is on contrast to the behavior on other microcontrollers.

As a first step I used my multimeter to verify the current consumption vs supply voltage during the sleep modes as shown above. I was basically able to reproduce the curves from the datasheet, which confirms that the datasheet is correct and that my handheld multimeter is actually able to accurately measure currents as low as a few 100 nA! Not something I had expected, to be honest.

During this I found one peculiar behavor of the PFS154. The pin change wakeup is always enabled by default after reset. It appears that very low changes on the pins can generate a wakeup. If they are left floating, it is sufficient to touch a pin to wake up the core. Interestingly this even applied to pins that were not routed outside of the package put still existed as pads on the die. By touching the surface of the IC it was possible to generate a wake up event! Unless you are interested in building a hacky touch sensor it is therefore adviced to disable all pins as a wakeup source.

Implementation of timer wakeup

Since I want to build a LED flasher, I used Timer2 to generate a wakeup event at a frequency of around 1.6Hz. You can see the full code for STOPEXE configuration and timer initialization below.

/* Configure STOPEXE mode and set up Timer 2 as wake up source */

  PADIER = 0; // disable pins as wakeup source
  PBDIER = 0; // Also port B needs to be disabled even if it is 
      // not connected to the outside of the package. 
      // Touching the package can introduce glitches and wake
      // up the device

  INTEN = 0;  // Make sure all interrupts are disabled
  INTRQ = 0;

  MISC = MISC_FAST_WAKEUP_ENABLE;  
      // Enable faster wakeup (45 clocks instead of 3000)
      // This is important to not waste energy, as 40µA bias 
      // is already added during wakeup time

  TM2C  = TM2C_CLK_ILRC | TM2C_MODE_PWM;  
      // Oscilator source for timer 2 is LRC (53 kHZ)
  TM2CT = 0;
  TM2S  = TM2S_PRESCALE_DIV16 | TM2S_SCALE_DIV8; 
      // Divide clock by 16*7=112 -> 53 kHz / 122 = 414 Hz    
  TM2B  = 1;  
      // PWM threshold set to 1. 
      // The PWM event will trigger the wakeup. 
      // Wakeup occurs with 414 Hz / 256 = 1.66 Hz

One important optimization was to turn on the “fast wakeup mode”. The normal wakeup mode takes around 3000 clocks cycles during which around 40µA of current are consumed. I found that the 8 bit timers can also be used as PWM generators during STOPEXE mode. However it is not possible to prevent them from waking up the CPU, so they cannot be used autonomously.

LED flashing code

The only part that remains is the code to actually flash the LED. This is rather simple as shown below.

/* Initialize LED I/O and flash the LED */

  PA    = 1<<4; // LED is on PA4, set all other output to zero.
  PAPH   = 0;   // Disable all pull up resistors
  PAC    = 0;   // Disable all outputs
    // Note: There is no series resistor for the LED
    // The LED current is limited LOW IO driving setting
    // See Setction 4.14 (p24) in PFS154 manual
    // The output is disabled when the LED is off 
    // to avoid leakage

  for (;;) {  
    PAC |=1<<4;  
    // Enable LED output (It's set to High)
    __nop();
    __nop();
    __nop();
    PAC &=~(1<<4); 
    // Disable LED output after 4 cycles => 4/53 kHz = 75.5 µS
    __stopexe();
  }

The processor core will wakeup after every event generated by timer2, turn on the LED for 75.5µS, and put the core to sleep again. The LED is directly connected to an output pin without series resistor while the pin is configured to low I/O driving strength to limit the maximum current. This is somewhat risky, but allows operating the LED down to the lowest possible voltage – around 2.1V for the green LED I am using.

Current Consumption Performace

Well, the code works nicely and flashes the LED at 1.6 Hz at voltages down to slightly above 2V. You can find the complete source here.

To evaluate that everything works correctly, I set up a simple power model that considers sleep mode current, active current and the current used by the LED. The LED current was determined by measuring the on current of the LED connected to the microcontroller at different supply voltages and multiplying it by the duty cycle. The same approach was taken for the active current of the MCU. You can see the output of the model above and comparison with a measurements. I had to use a series resistor of a few kilohm and a parallel capacitor to ensure that the current ripple was smoothed enough to allow for a steady reading on the multimeters.

As you can see, there is good agreement between model and measurements. Due to the extremely low duty cycle of the LED, the main power consumption is still the MCU and the timer. This contribution is highly dependent on supply voltage, so that the most energy efficicent operation is achieved at the lowest voltage.

The total current consumption at 3V is only about 1 µA! This is less than the self-discharge current of many batteries. A single CR2023 cell with a capacity of around 200 mAh could theoretically power this flasher for 200000 hours, or 22 years! I was able to operate the circuit (as shown in the title image) for more than 10 minutes based on the charge of a 330µF capacitor charged to 5V.

Summary

Despite their low cost, the Padauk MCUs can be used for extremely low power operation. There are certainly ways to improve the flasher circuit more, for example by using an inductive boost converter to allow constant current operation of the LED at even lower voltages.

❌