Learning with ChatGPT

By: VM

7 February 2025 at 11:39

I have access to the premium version of ChatGPT, and every day I ask it a few questions about concepts in physics that I’d like to know more about. Yesterday, for example, I learnt the difference between quenching and annealing…

Is there an opposite phenomenon for quenching?

Yes! The opposite of quenching would be processes that allow gradual relaxation, equilibration, or annealing rather than rapid freezing. Some terms that describe this include:

Annealing (Thermal and Simulated) — In metallurgy and physics, annealing is the process of slow cooling a material so it can reach a more ordered, lower-energy state.

In simulated annealing (used in optimization and machine learning), a system is gradually "cooled" to allow it to find a global minimum rather than getting trapped in local minima (which can happen in quenching).

Slow Equilibration / Adiabatic Processes — Unlike quenching, which often leads to metastable states, slow thermalization allows the system to reach true equilibrium.

Adiabatic processes in quantum mechanics and condensed matter physics allow the system to evolve slowly so that it remains in a well-defined state.

Supercooling vs. Slow Nucleation — When liquids are rapidly cooled (quenched), they might bypass crystallization and form a glass. In contrast, if cooling is slow, proper nucleation occurs, leading to a well-ordered crystal.

Self-Organization & Aging in Complex Systems — Some complex systems exhibit aging (e.g., glasses, biological systems) where they slowly evolve toward an ordered or more stable configuration over time, in contrast to being frozen in a quenched state.

… and later about the Weierstrass function.

Please explain the Weierstrass function to me.

The Weierstrass function is a famous example of a continuous but nowhere differentiable function. It is defined as:

[ W(x) = \sum_{n=0}^{\infty} aⁿ \cos(bⁿ \pi x) ]

where:

( 0 < a < 1 ) (controls decay)
( b ) is an odd integer ( > 1 ) (controls frequency)
The product ( ab > 1 + \frac{3\pi}{2} ) ensures nowhere differentiability.

Key Properties

Continuous Everywhere — Despite its wild oscillations, ( W(x) ) is continuous for all ( x ).

Nowhere Differentiable — No well-defined slope exists at any point, making it an early example of a function that is continuous yet "rough" at all scales.

Self-Similarity (Fractal-like Behavior) — Since it is constructed as an infinite sum of cosine waves with exponentially increasing frequencies, it exhibits self-similarity: small-scale structures resemble larger-scale patterns.

Role in Mathematics — This function was introduced by Karl Weierstrass in the 19th century as a counterexample to the assumption that "continuous functions are differentiable almost everywhere."

I have found using ChatGPT in this way to be a valuable learning tool, and I imagine there are already people out there who are repackaging instances of ChatGPT to be autonomous ‘teachers’ for students looking to learn something.

One of my jobs at The Hindu is to get explanatory articles for concepts in (pure) science. I recently attempted one on amplifiers, which required me to explain the working of a NPN bipolar junction transistor, a device I’d until then been happy to leave behind in my high school classroom. I turned to ChatGPT, asking for a visualisation of the transistor, and it obliged. I’m a visual learner and having access to the tool made a big difference.

How does an NPN bipolar junction transistor work? Please explain with a visualisation.

An NPN bipolar junction transistor (BJT) is a type of transistor that consists of three layers of semiconductor material: two n-type layers (emitter and collector) separated by a p-type layer (base). It is commonly used for switching and amplification.

How an NPN Transistor Works

Structure & Terminals:

Emitter (E): Heavily doped n-type region that emits electrons.

Base (B): Thin, lightly doped p-type region.

Collector (C): Moderately doped n-type region that collects electrons.

Operation Principles:

When a small current is applied to the base-emitter junction (forward-biased), it allows a larger current to flow from the emitter to the collector.

The base-emitter junction acts as a diode, allowing electrons to flow into the base.

Most electrons injected from the emitter into the base are swept into the collector due to the voltage difference.

Visualization

      Collector (C)
           |
           |
      |----|----|
      |    N    |   <-- Collector (N-type)
      |----|----|
           |
      |----|----|
      |    P    |   <-- Base (P-type, thin and lightly doped)
      |----|----|
           |
      |----|----|
      |    N    |   <-- Emitter (N-type, heavily doped)
      |----|----|
           |
           |
      Emitter (E)

Current Flow:

Electrons flow from Emitter (E) to Collector (C).

A small Base (B) current controls a much larger Collector (C) current.

Voltage Conditions:

V_BE (Base-Emitter voltage) > 0.7V (Forward biased)

V_CE (Collector-Emitter voltage) > V_BE (Reverse biased)

Electron Flow:

→ Represents electron flow.

I_B (Base Current) is small but controls I_C (Collector Current), which is large.

( +V_C ) | | [C] ← Electrons | | | [B] → Small Base Current (I_B) | | | [E] → Large Electron Flow (I_E) | ( GND )

Working Mechanism:

The Base-Emitter junction is forward-biased, allowing electrons to move from Emitter to Base.

Most electrons are swept into the Collector due to the reverse-biased Collector-Base junction.

A small Base current (I_B) controls a large Collector current (I_C), amplifying the input signal.

I have a background in engineering plus more than a decade’s experience in spotting yellow/red flags in scientific papers, and I imagine anyone with these skills will have an easier time navigating ChatGPT’s answers. For anyone else that lacks these skills, all we need is a guarantee from OpenAI that the tool doesn’t hallucinate or that it hallucinates in specific contexts, and definitely not above a certain rate.

Tim's+Blog
Neural Network Visualization 31 October 2024 at 15:45

Neural Network Visualization

Tim's+Blog

By: cpldcpu

31 October 2024 at 15:45

Going along with implementing a very size optimized neural network on a 3 cent microcontroller I created an interactive simulation of a similar network.

You can draw figures on a 8×8 pixel grid and view how the activations propagate through the multi-layer perception network to classify the image into 4 or 10 different numbers. You can find the visualizer online here.

Amazingly, the accuracy is still quite acceptable, even though this network (2 hidden layers with 10 neurons) is even simpler than the one implemented in the 3 cent MCU (3 hidden layers with 16 neurons). One change that led to significant improvements was to use layernorm, which normalizes mean and stddev, instead of RMSnorm, which only normalizes the stddeviation.

Source for React app and training code can be found here. This program was almost exclusively created by prompting with 3.5-Sonnet in the Claude interface and GH Copilot.

Tim's+Blog
Neural Network Visualization 31 October 2024 at 15:45

Neural Network Visualization

Tim's+Blog

By: cpldcpu

31 October 2024 at 15:45

Going along with implementing a very size optimized neural network on a 3 cent microcontroller I created an interactive simulation of a similar network.

You can draw figures on a 8×8 pixel grid and view how the activations propagate through the multi-layer perception network to classify the image into 4 or 10 different numbers. You can find the visualizer online here.

Amazingly, the accuracy is still quite acceptable, even though this network (2 hidden layers with 10 neurons) is even simpler than the one implemented in the 3 cent MCU (3 hidden layers with 16 neurons). One change that led to significant improvements was to use layernorm, which normalizes mean and stddev, instead of RMSnorm, which only normalizes the stddeviation.

Source for React app and training code can be found here. This program was almost exclusively created by prompting with 3.5-Sonnet in the Claude interface and GH Copilot.

Tim's+Blog
Neural Networks (MNIST inference) on the “3-cent” Microcontroller 2 May 2024 at 23:59

Neural Networks (MNIST inference) on the “3-cent” Microcontroller

Tim's+Blog

By: cpldcpu

2 May 2024 at 23:59

Bouyed by the surprisingly good performance of neural networks with quantization aware training on the CH32V003, I wondered how far this can be pushed. How much can we compress a neural network while still achieving good test accuracy on the MNIST dataset? When it comes to absolutely low-end microcontrollers, there is hardly a more compelling target than the Padauk 8-bit microcontrollers. These are microcontrollers optimized for the simplest and lowest cost applications there are. The smallest device of the portfolio, the PMS150C, sports 1024 13-bit word one-time-programmable memory and 64 bytes of ram, more than an order of magnitude smaller than the CH32V003. In addition, it has a proprieteray accumulator based 8-bit architecture, as opposed to a much more powerful RISC-V instruction set.

Is it possible to implement an MNIST inference engine, which can classify handwritten numbers, also on a PMS150C?

On the CH32V003 I used MNIST samples that were downscaled from 28×28 to 16×16, so that every sample take 256 bytes of storage. This is quite acceptable if there is 16kb of flash available, but with only 1 kword of rom, this is too much. Therefore I started with downscaling the dataset to 8×8 pixels.

The image above shows a few samples from the dataset at both resolutions. At 16×16 it is still easy to discriminate different numbers. At 8×8 it is still possible to guess most numbers, but a lot of information is lost.

Suprisingly, it is still possible to train a machine learning model to recognize even these very low resolution numbers with impressive accuracy. It’s important to remember that the test dataset contains 10000 images that the model does not see during training. The only way for a very small model to recognize these images accurate is to identify common patterns, the model capacity is too limited to “remember” complete digits. I trained a number of different network combinations to understand the trade-off between network memory footprint and achievable accuracy.

Parameter Exploration

The plot above shows the result of my hyperparameter exploration experiments, comparing models with different configurations of weights and quantization levels from 1 to 4 bit for input images of 8×8 and 16×16. The smallest models had to be trained without data augmentation, as they would not converge otherwise.

Again, there is a clear relationship between test accuracy and the memory footprint of the network. Increasing the memory footprint improves accuracy up to a certain point. For 16×16, around 99% accuracy can be achieved at the upper end, while around 98.5% is achieved for 8×8 test samples. This is still quite impressive, considering the significant loss of information for 8×8.

For small models, 8×8 achieves better accuracy than 16×16. The reason for this is that the size of the first layer dominates in small models, and this size is reduced by a factor of 4 for 8×8 inputs.

Surprisingly, it is possible to achieve over 90% test accuracy even on models as small as half a kilobyte. This means that it would fit into the code memory of the microcontroller! Now that the general feasibility has been established, I needed to tweak things further to accommodate the limitations of the MCU.

Training the Target Model

Since the RAM is limited to 64 bytes, the model structure had to use a minimum number of latent parameters during inference. I found that it was possible to use layers as narrow as 16. This reduces the buffer size during inference to only 32 bytes, 16 bytes each for one input buffer and one output buffer, leaving 32 bytes for other variables. The 8×8 input pattern is directly read from the ROM.

Furthermore, I used 2-bit weights with irregular spacing of (-2, -1, 1, 2) to allow for a simplified implementation of the inference code. I also skipped layer normalization and instead used a constant shift to rescale activations. These changes slightly reduced accuracy. The resulting model structure is shown below.

All things considered, I ended up with a model with 90.07% accuracy and a total of 3392 bits (0.414 kilobytes) in 1696 weights, as shown in the log below. The panel on the right displays the first layer weights of the trained model, which directly mask features in the test images. In contrast to the higher accuracy models, each channel seems to combine many features at once, and no discernible patterns can be seen.

Implementation on the Microntroller

In the first iteration, I used a slightly larger variant of the Padauk Microcontrollers, the PFS154. This device has twice the ROM and RAM and can be reflashed, which tremendously simplifies software development. The C versions of the inference code, including the debug output, worked almost out of the box. Below, you can see the predictions and labels, including the last layer output.

Squeezing everything down to fit into the smaller PMS150C was a different matter. One major issue when programming these devices in C is that every function call consumes RAM for the return stack and function parameters. This is unavoidable because the architecture has only a single register (the accumulator), so all other operations must occur in RAM.

To solve this, I flattened the inference code and implemented the inner loop in assembly to optimize variable usage. The inner loop for memory-to-memory inference of one layer is shown below. The two-bit weight is multiplied with a four-bit activation in the accumulator and then added to a 16-bit register. The multiplication requires only four instructions (t0sn, sl,t0sn,neg), thanks to the powerful bit manipulation instructions of the architecture. The sign-extending addition (add, addc, sl, subc) also consists of four instructions, demonstrating the limitations of 8-bit architectures.

void fc_innerloop_mem(uint8_t loops) {

    sum = 0;
    do  {
       weightChunk = *weightidx++;
__asm   
    idxm  a, _activations_idx
	inc	_activations_idx+0

    t0sn _weightChunk, #6
    sl     a            ;    if (weightChunk & 0x40) in = in+in;
    t0sn _weightChunk, #7
    neg    a           ;     if (weightChunk & 0x80) in =-in;                    

    add    _sum+0,a
    addc   _sum+1
    sl     a 
    subc   _sum+1  

  ... 3x more ...

__endasm;
    } while (--loops);

    int8_t sum8 = ((uint16_t)sum)>>3; // Normalization
    sum8 = sum8 < 0 ? 0 : sum8; // ReLU
    *output++ = sum8;
}

In the end, I managed to fit the entire inference code into 1 kilowords of memory and reduced sram usage to 59 bytes, as seen below. (Note that the output from SDCC is assuming 2 bytes per instruction word, while it is only 13 bits).

Success! Unfortunately, there was no rom space left for the soft UART to output debug information. However, based on the verificaiton on PFS154, I trust that the code works, and since I don’t have any specific application in mind, I left it at that stage.

Summary

It is indeed possible to implement MNIST inference with good accuracy using one of the cheapest and simplest microcontrollers on the market. A lot of memory footprint and processing overhead is usually spent on implementing flexible inference engines, that can accomodate a wide range of operators and model structures. Cutting this overhead away and reducing the functionality to its core allows for astonishing simplification at this very low end.

This hack demonstrates that there truly is no fundamental lower limit to applying machine learning and edge inference. However, the feasibility of implementing useful applications at this level is somewhat doubtful.

You can find the project repository here.

Pixel Envy
Apple Intelligence 11 June 2024 at 14:00

Apple Intelligence

Pixel Envy

By: Nick Heer

11 June 2024 at 14:00

Daniel Jalkut, last ~~month~~ year:

Which leads me to my somewhat far-fetched prediction for WWDC: Apple will talk about AI, but they won’t once utter the letters “AI”. They will allude to a major new initiative, under way for years within the company. The benefits of this project will make it obvious that it is meant to serve as an answer to comparable efforts being made by OpenAI, Microsoft, Google, and Facebook. During the crescendo to announcing its name, the letters “A” and “I” will be on all of our lips, and then they’ll drop the proverbial mic: “We’re calling it Apple Intelligence.” Get it?

Apple:

Apple today introduced Apple Intelligence, the personal intelligence system for iPhone, iPad, and Mac that combines the power of generative models with personal context to deliver intelligence that’s incredibly useful and relevant. Apple Intelligence is deeply integrated into iOS 18, iPadOS 18, and macOS Sequoia. It harnesses the power of Apple silicon to understand and create language and images, take action across apps, and draw from personal context to simplify and accelerate everyday tasks. With Private Cloud Compute, Apple sets a new standard for privacy in AI, with the ability to flex and scale computational capacity between on-device processing and larger, server-based models that run on dedicated Apple silicon servers.

To Apple’s credit, the letters “A.I.” were only enunciated a handful of times during its main presentation today, far less often than I had expected. Mind you, in sixty-odd places, “A.I.” was instead referred to by the branded “Apple Intelligence” moniker which is also “A.I.” in its own way. I want half-right points.

There are several concerns with features like these, and Apple answered two of them today: how it was trained, and the privacy and security of user data. The former was not explained during today’s presentation, nor in its marketing materials and developer documentation. But it was revealed by John Giannandrea, senior vice president of Machine Learning and A.I. Strategy, in an afternoon question-and-answer session hosted by Justine Ezarik, as live-blogged by Nilay Patel at the Verge:¹

What have these models actually been trained on? Giannandrea says “we start with the investment we have in web search” and start with data from the public web. Publishers can opt out of that. They also license a wide amount of data, including news archives, books, and so on. For diffusion models (images) “a large amount of data was actually created by Apple.”

If publishers wish to opt out of Apple’s training models but continue to permit crawling for things like Siri and Spotlight, they should add a disallow rule for Applebot-Extended. Because of Apple’s penchant for secrecy, that usage control was not added until today. That means a site may have been absorbed into training data unless its owners opted out of all Applebot crawling. Hard to decline participating in something you do not even know about.

Additionally, in April, Katie Paul and Anna Tong reported for Reuters that Apple struck a licensing agreement with Shutterstock for image training purposes.

Apple is also, unsurprisingly, promoting heavily the privacy and security policies it has in place. It noted some of these attributes in its presentation — including some auditable code and data minimization — and elaborated on Private Cloud Compute on its security blog:

With services that are end-to-end encrypted, such as iMessage, the service operator cannot access the data that transits through the system. One of the key reasons such designs can assure privacy is specifically because they prevent the service from performing computations on user data. Since Private Cloud Compute needs to be able to access the data in the user’s request to allow a large foundation model to fulfill it, complete end-to-end encryption is not an option. Instead, the PCC compute node must have technical enforcement for the privacy of user data during processing, and must be incapable of retaining user data after its duty cycle is complete.

[…]

User data is never available to Apple — even to staff with administrative access to the production service or hardware.

Apple can make all the promises it wants, and it appears it does truly want to use generative A.I. in a more responsible way. For example, the images you can make using Image Playground cannot be photorealistic and — at least for those shown so far — are so strange you may avoid using them. Similarly, though I am not entirely sure, it seems plausible the query system is designed to be more private and secure than today’s Siri.

Yet, as I wrote last week, users may not trust any of these promises. Many of these fears are logical: people are concerned about the environment, creative practices, and how their private information is used. But some are more about the feel of it — and that is okay. Even if all the training data were fully licensed and user data is as private and secure as Apple says, there is still an understandable ick factor for some people. The way companies like Apple, Google, and OpenAI have trained their A.I. models on the sum of human creativity represents a huge imbalance of power, and the only way to control Apple’s public data use was revealed yesterday. Many of the controls Apple has in place are policies which can be changed.

Consider how, so far as I can see, there will be no way to know for certain if your Siri query is being processed locally or by Apple’s servers. You do not know that today when using Siri, though you can infer it based on what you are doing and if something does not work when Apple’s Siri service is down. It seems likely that will be the case with this new version, too.

Then there are questions about the ethos of generative intelligence. Apple has long positioned its products as tools which enable people to express themselves creatively. Generative models have been pitched as almost the opposite: now, you do not have to pay for someone’s artistic expertise. You can just tell a computer to write something and it will do so. It may be shallow and unexciting, but at least it was free and near-instantaneous. Apple notably introduced its set of generative services only a month after it embarrassed itself by crushing analogue tools into an iPad. Happily, it seems this first set of generative features is more laundry and less art — making notifications less intrusive, categorizing emails, making Siri not-shit. I hope I can turn off things like automatic email replies.

You will note my speculative tone. That is because Apple’s generative features have not been made available yet, including in developer beta builds of its new operating system. None of us have any idea how useful these features are, nor what limitations they have. All we can see are Apple’s demonstrations and the metrics it has shared. So, we will see how any of this actually pans out. I have been bamboozled by this same corporation making similar promises before.

“May you live in interesting times”, indeed.

The Verge’s live blog does not have per-update permalinks so you will need to load all the messages and find this for yourself. ↥︎

⌥ Permalink

Tim's+Blog
Neural Networks (MNIST inference) on the “3-cent” Microcontroller 2 May 2024 at 23:59

Neural Networks (MNIST inference) on the “3-cent” Microcontroller

Tim's+Blog

By: cpldcpu

2 May 2024 at 23:59

Bouyed by the surprisingly good performance of neural networks with quantization aware training on the CH32V003, I wondered how far this can be pushed. How much can we compress a neural network while still achieving good test accuracy on the MNIST dataset? When it comes to absolutely low-end microcontrollers, there is hardly a more compelling target than the Padauk 8-bit microcontrollers. These are microcontrollers optimized for the simplest and lowest cost applications there are. The smallest device of the portfolio, the PMS150C, sports 1024 13-bit word one-time-programmable memory and 64 bytes of ram, more than an order of magnitude smaller than the CH32V003. In addition, it has a proprieteray accumulator based 8-bit architecture, as opposed to a much more powerful RISC-V instruction set.

Is it possible to implement an MNIST inference engine, which can classify handwritten numbers, also on a PMS150C?

On the CH32V003 I used MNIST samples that were downscaled from 28×28 to 16×16, so that every sample take 256 bytes of storage. This is quite acceptable if there is 16kb of flash available, but with only 1 kword of rom, this is too much. Therefore I started with downscaling the dataset to 8×8 pixels.

The image above shows a few samples from the dataset at both resolutions. At 16×16 it is still easy to discriminate different numbers. At 8×8 it is still possible to guess most numbers, but a lot of information is lost.

Suprisingly, it is still possible to train a machine learning model to recognize even these very low resolution numbers with impressive accuracy. It’s important to remember that the test dataset contains 10000 images that the model does not see during training. The only way for a very small model to recognize these images accurate is to identify common patterns, the model capacity is too limited to “remember” complete digits. I trained a number of different network combinations to understand the trade-off between network memory footprint and achievable accuracy.

Parameter Exploration

The plot above shows the result of my hyperparameter exploration experiments, comparing models with different configurations of weights and quantization levels from 1 to 4 bit for input images of 8×8 and 16×16. The smallest models had to be trained without data augmentation, as they would not converge otherwise.

Again, there is a clear relationship between test accuracy and the memory footprint of the network. Increasing the memory footprint improves accuracy up to a certain point. For 16×16, around 99% accuracy can be achieved at the upper end, while around 98.5% is achieved for 8×8 test samples. This is still quite impressive, considering the significant loss of information for 8×8.

For small models, 8×8 achieves better accuracy than 16×16. The reason for this is that the size of the first layer dominates in small models, and this size is reduced by a factor of 4 for 8×8 inputs.

Surprisingly, it is possible to achieve over 90% test accuracy even on models as small as half a kilobyte. This means that it would fit into the code memory of the microcontroller! Now that the general feasibility has been established, I needed to tweak things further to accommodate the limitations of the MCU.

Training the Target Model

Since the RAM is limited to 64 bytes, the model structure had to use a minimum number of latent parameters during inference. I found that it was possible to use layers as narrow as 16. This reduces the buffer size during inference to only 32 bytes, 16 bytes each for one input buffer and one output buffer, leaving 32 bytes for other variables. The 8×8 input pattern is directly read from the ROM.

Furthermore, I used 2-bit weights with irregular spacing of (-2, -1, 1, 2) to allow for a simplified implementation of the inference code. I also skipped layer normalization and instead used a constant shift to rescale activations. These changes slightly reduced accuracy. The resulting model structure is shown below.

All things considered, I ended up with a model with 90.07% accuracy and a total of 3392 bits (0.414 kilobytes) in 1696 weights, as shown in the log below. The panel on the right displays the first layer weights of the trained model, which directly mask features in the test images. In contrast to the higher accuracy models, each channel seems to combine many features at once, and no discernible patterns can be seen.

Implementation on the Microntroller

In the first iteration, I used a slightly larger variant of the Padauk Microcontrollers, the PFS154. This device has twice the ROM and RAM and can be reflashed, which tremendously simplifies software development. The C versions of the inference code, including the debug output, worked almost out of the box. Below, you can see the predictions and labels, including the last layer output.

Squeezing everything down to fit into the smaller PMS150C was a different matter. One major issue when programming these devices in C is that every function call consumes RAM for the return stack and function parameters. This is unavoidable because the architecture has only a single register (the accumulator), so all other operations must occur in RAM.

To solve this, I flattened the inference code and implemented the inner loop in assembly to optimize variable usage. The inner loop for memory-to-memory inference of one layer is shown below. The two-bit weight is multiplied with a four-bit activation in the accumulator and then added to a 16-bit register. The multiplication requires only four instructions (t0sn, sl,t0sn,neg), thanks to the powerful bit manipulation instructions of the architecture. The sign-extending addition (add, addc, sl, subc) also consists of four instructions, demonstrating the limitations of 8-bit architectures.

void fc_innerloop_mem(uint8_t loops) {

    sum = 0;
    do  {
       weightChunk = *weightidx++;
__asm   
    idxm  a, _activations_idx
	inc	_activations_idx+0

    t0sn _weightChunk, #6
    sl     a            ;    if (weightChunk & 0x40) in = in+in;
    t0sn _weightChunk, #7
    neg    a           ;     if (weightChunk & 0x80) in =-in;                    

    add    _sum+0,a
    addc   _sum+1
    sl     a 
    subc   _sum+1  

  ... 3x more ...

__endasm;
    } while (--loops);

    int8_t sum8 = ((uint16_t)sum)>>3; // Normalization
    sum8 = sum8 < 0 ? 0 : sum8; // ReLU
    *output++ = sum8;
}

In the end, I managed to fit the entire inference code into 1 kilowords of memory and reduced sram usage to 59 bytes, as seen below. (Note that the output from SDCC is assuming 2 bytes per instruction word, while it is only 13 bits).

Success! Unfortunately, there was no rom space left for the soft UART to output debug information. However, based on the verificaiton on PFS154, I trust that the code works, and since I don’t have any specific application in mind, I left it at that stage.

Summary

It is indeed possible to implement MNIST inference with good accuracy using one of the cheapest and simplest microcontrollers on the market. A lot of memory footprint and processing overhead is usually spent on implementing flexible inference engines, that can accomodate a wide range of operators and model structures. Cutting this overhead away and reducing the functionality to its core allows for astonishing simplification at this very low end.

This hack demonstrates that there truly is no fundamental lower limit to applying machine learning and edge inference. However, the feasibility of implementing useful applications at this level is somewhat doubtful.

You can find the project repository here.

Tim's+Blog
Implementing Neural Networks on the “10-cent” RISC-V MCU without Multiplier 24 April 2024 at 10:20

Implementing Neural Networks on the “10-cent” RISC-V MCU without Multiplier

Tim's+Blog

By: cpldcpu

24 April 2024 at 10:20

I have been meaning for a while to establish a setup to implement neural network based algorithms on smaller microcontrollers. After reviewing existing solutions, I felt there is no solution that I really felt comfortable with. One obvious issue is that often flexibility is traded for overhead. As always, for a really optimized solution you have to roll your own. So I did. You can find the project here and a detailed writeup here.

It is always easier to work with a clear challenge: I picked the CH32V003 as my target platform. This is the smallest RISC-V microcontroller on the market right now, addressing a $0.10 price point. It sports 2kb of SRAM and 16kb of flash. It is somewhat unique in implementing the RV32EC instruction set architecture, which does not even support multiplications. In other words, for many purposes this controller is less capable than an Arduino UNO.

As a test subject I chose the well-known MNIST dataset, which consists of images of hand written numbers which need to be classified from 0 to 9. Many inspiring implementation on Arduino exist for MNIST, for example here. In this case, the inference time was 7 seconds and 82% accuracy was achieved.

The idea is to train a neural network on a PC and optimize it for inference on teh CH32V003 while meetings these criteria:

Be as fast and as accurate as possible
Low SRAM footprint during inference to fit into 2kb sram
Keep the weights of the neural network as small as possible
No multiplications!

These criteria can be addressed by using a neural network with quantized weights, were each weight is represented with as few bits as possible. The best possible results are achieved when training the network already on quantized weights (Quantization Aware Training) as opposed to quantized a model that was trained with high accuracy weights. There is currently some hype around using Binary and Ternary weights for large language models. But indeed, we can also use these approaches to fit a neural network to a small microcontroller.

The benefit of only using a few bits to represent each weight is that the memory footprint is low and we do not need a real multiplication instruction – inference can be reduced to additions only.

Model structure and optimization

For simplicity reasons, I decided to go for a e network architecture based on fully-connected layers instead of convolutional neural networks. The input images are reduced to a size of 16×16=256 pixels and are then fed into the network as shown below.

The implementation of the inference engine is straightforward since only fully connected layers are used. The code snippet below shows the innerloop, which implements multiplication of 4 bit weights by using adds and shifts. The weights use a one-complement encoding without zero, which helps with code efficiency. One bit, ternary, and 2 bit quantization was implemented in a similar way.

    int32_t sum = 0;
    for (uint32_t k = 0; k < n_input; k+=8) {
        uint32_t weightChunk = *weightidx++;

        for (uint32_t j = 0; j < 8; j++) {
            int32_t in=*activations_idx++;
            int32_t tmpsum = (weightChunk & 0x80000000) ? -in : in; 
            sum += tmpsum;                                  // sign*in*1
            if (weightChunk & 0x40000000) sum += tmpsum<<3; // sign*in*8
            if (weightChunk & 0x20000000) sum += tmpsum<<2; // sign*in*4
            if (weightChunk & 0x10000000) sum += tmpsum<<1; // sign*in*2
            weightChunk <<= 4;
        }
    }
    output[i] = sum;

In addition the fc layers also normalization and ReLU operators are required. I found that it was possible to replace a more complex RMS normalization with simple shifts in the inference. Not a single full 32×32 multiplication is needed for the inference! Having this simple structure for inference means that we have to focus the effort on the training part.

I studied variations of the network with different numbers of bits and different sizes by varying the numer of hidden activiations. To my surprise I found that the accuracy of the prediction is proportional to the total number of bits used to store the weights. For example, when 2 bits are used for each weight, twice the numbers of weights are needed to achieve the same perforemnce as a 4 bit weight network. The plot below shows training loss vs. total number of bits. We can see that for 1-4 bits, we can basically trade more weights for less bits. This trade-off is less efficient for 8 bits and no quantization (fp32).

I further optimized the training by using data augmentation, a cosine schedule and more epochs. It seems that 4 bit weights offered the best trade off.

More than 99% accuracy was achieved for 12 kbyte model size. While it is possible to achiever better accuracy with much larger models, it is significantly more accurate than other on-MCU implementations of MNIST.

Implementation on the Microcontroller

The model data is exported to a c-header file for inclusion into the inference code. I used the excellent ch32v003fun environment, which allowed me to reduce overhead to be able to store 12kb of weights plus the inference engine in only 16kb of flash.

There was still enough free flash to include 4 sample images. The inference output is shown above. Execution time for one inference is 13.7 ms which would actually allow to model to process moving image input in real time.

Alternatively, I also tested a smaller model with 4512 2-bit parameters and only 1kb of flash memory footprintg. Despite its size, it still achieves a 94.22% test accuracy and it executes in only 1.88ms.

Conclusions

This was quite a tedious projects, hunting many lost bits and rounding errors. I am quite pleased with the outcome as it shows that it is possible to compress neural networks very significantly with dedicated effort. I learned a lot and am planning to use the data pipeline for more interesting applications.

Normal view

Parameter Exploration

Training the Target Model

Implementation on the Microntroller

Summary

Parameter Exploration

Training the Target Model

Implementation on the Microntroller

Summary

Model structure and optimization

Implementation on the Microcontroller

Conclusions