Introduction
When it comes to embedded software, there are at least two distinct ways to use AI:
We can use AI while we develop code.
We may use ChatGPT in a window "on the side". Or it may be a more integrated scenario - using e.g., Copilot inside VS Code as "intellisense++". AI can also be used on the build-server to analyze code in what we could call "automated reviews" and support other processes.We can include AI in the product we develop.
Even though I believe that there are plenty of low-hanging fruits in the first bullet above, the goal of this page is to dig into the second bullet - AI in the actual embedded system. Until recently, this has mostly been mimicked by letting the embedded system collect input data, send it to a cloud server that runs it through AI, and returns with an answer. This is known from e.g., Amazon's original Echo/Alexa and similar products.
In my book Embedded Software for the IoT I describe SPC - Statistical Process Control - in e.g., factories, as something that used to be driven by central computers, which can now run in smaller "edge" system locally in production. This is a solution that scales and is very resilient against single-point-of-failure scenarios. Still, in the SPC-scenario, the embedded "edge" system would be running classic SPC algorithms - written by human developers.
With AI we can harvest similar benefits - and more. AI can be used as the "next level", where a small specifically trained model can run in the embedded system and take local decisions. There are also hybrid solutions. In the SPC-case, the edge system would report statistics on e.g., an hourly basis (when possible) to a central SPC-system. The AI successor might do the same - creating overall statistics, graphs and images, in the cloud - integrating data from a complete factory.So far I have used the acronym AI for Artificial Intelligence. There are many types of AI - like Expert Systems, Robots etc, but we will mainly be dealing with the subset known as Machine Learning - ML. When you don't need to write any actual code, but can feed the machine with data to make it work, it is Machine Learning. When Machine Learning is realised using Neural Networks in many layers it is called Deep Learning". This is our main subject here, but as many others I tend to use the umbrella-term "AI".
The work on "Artificial Intelligence" started in the 1950'ies. I remember in the late 1980'ies how it was revived in the form of neural networks and kind of died again, because it had little practical use (almost none in the embedded world) as it was then. This time it is clear that it can be used in a lot of fields - although it may be a bit hyped.
So what has changed, since it now is so useful? Basically two things: Software and Hardware.
Based on the literature, it seems that the current AI wave took off in 2017 with an article called "Attention is all you need" (see links) - probably a reference to Beatles "All you need is love". This is the software part of the change - or rather, it's the single step that really stands out when looking in the rearview-mirror.
Learning from ChatGPT
I am sure that we all know ChatGPT. The "GPT" part means "Generative, Pre-trained Transformer" - and in ChatGPT, this "transformer" can participate in a dialog with the user. This is a huge step forward compared to the now simple Google search, where each search-phrase (or prompt) is a fresh start, and where answers can only be links to existing pages. Let's look at the ChatGPT terms one-by-one:
Generative
This means that it can generate new content from whatever is input - normally a "prompt", but it may also be data from sensors in an embedded system. Since AI works on numbers, we actually skip some work by having numbers coming in.Pretrained
This means that the original - very generic - model has been trained with a lot of input before we started using it. The training adjusts weights in matrices used in the math engine. In the case of ChatGPT4 and upwards, the training material consists of more than half the content found on the internet.As we know, an embedded system is basically a computer with only one task (it can be a very complex task - like driving a car). Thus an embedded AI system might be trained with an insanely small fraction of what goes into ChatGPT - targeted for the narrow purpose of the embedded system. This means that while generic ChatGPT and friends/opponents needs vast resources, in terms of memory and CPU-power, there is still sanity in trying to fit an ML-system into a resource-restricted embedded system.
Transformer
This is a name for the specific "deep learning" architecture from the 2017 article. I think the name partly was invented because the model can transform a printed text into a translation in another language - and partly because the team behind it was watching "Transformer" movies. It is used in most Large-Language Models - LLMs - and is very much related to the neural networks from the previous millenium. Later I will try to summarize what you can also read in the links.The main workhorse in neural networks is matrix multiplications, which are realized by cell Multiply-Accumulates - aka MACs. This is not just "how it turned out". Since many earlier concepts and also existing parts of the new machines are using matrices, we already have optimized heavily for these, and it makes sense to try to keep using matrices when possible. The fact that large matrix MAC sequences can be chopped up in parallel operations is also important.
This is why GPUs - Graphics Processing Units - predominantly from Nvidia - are used in AI, and thus become the hardware part of the breakthrough mentioned earlier. GPUs were originally created to offload the CPU from heavy matrix-multiplications needed in graphics board - e.g., moving and turning 3D objects on screen. As AI became interesting, it was relatively easy to adapt GPU-design, so that new generations could do the heavy-lifting in the "transformer", and Nvidia got a head-start.
In my book Microcontrollers with C, I describe CPUs and MCUs with built-in DSP functionality. There used to be a growth path: The smallest DSP-like engine would support the CPUs inherent integer-width (16, 32 or 64 bits) with limited instructions - centered around MAC - Multiply-Accumulate, which is also the heart of filters and FFTs. Next step was 32-bit floating point ("float" in C), and then bigger CPUs/MCUs would support 64-bit floating point ("double" in C) - and often more instructions as well.
Having watched this unfold for decades in my world of sound-processing, I was at first surprised when I saw that many of the newer chips now came with lower precisions - down to 8 bit integers and "half-precision" floating point - 16 bits. In most classic DSP-scenarios, the DSP is working on a fast stream of data, doing filters and FFTs. A 1024 point FFT with 32-bit data - integers or floats - is often quite sufficient. This occupies 4kBytes of data in and out. Depending on the choice of algorithm and buffering, you may need e.g. 10kB per sound-channel and quite a bit more for image-processsing. However, a deep-learning model will consume many times this amount of data. On the other hand - the output rate is much lower than what I am used to in the world of sound and vibration. And - as we will see; the requirements to dynamic range inside such a system is very limited.
Thus the embedded-chip designers changed their DSPs to (also) support lower-precision operations in order to preserve memory needs. Since numerous matrix-multiplication operations can be sped up by doing MACs (Multiply-Accumulate) in parallel, the AI-assisting DSPs are able to work on four or eight parallel 8-bit arrays. This allows the rest of the chip to work as for "normal" 32-bit or 64-bit word-width operations. Implementing algorithms in a way that utilizes these parallel matrix-calculations is absolutely not simple. That's why there are libraries like CMSIS-NN for ARM.
Chat
This refers to the dialog that the AI-system can do with a user. In an embedded system, you probably don't see a dialog, and the "chat" word is not used. The input will now be a mix of live data from sensors, historical data from the same sensors and configuration data.Training
The generic ML weights are first set at random values. Then the model is fed with data. The data may be labelled - like when a picture of a cat is given with the label "cat". The data can also be "self-referring" - you just feed unlabelled data into the model.Inference
The AI is used for decision making, pattern recognition, generation of new data etc.Privacy
With the AI only running in-field - not in a central server - we may get rid of privacy issues that we have seen before. If data is not sent to a central server, it cannot be leaked from this, while the embedded system will need to "act and forget" with it's limited resources.I recently read about a guy who had "debugged" his Roomba vacuum-cleaner and realized that it was sending a 3D-layout of his house to a central server. Not a big problem, many would say - but what if the same happens in a bank or a defense-installation?
Obviously, edge-based face-recognition in the public space is not exactly enforcing privacy, so there are also counter-arguments.
Efficiency
As stated, running the AI at the edge saves the system from transmitting data over the internet. This leads to lower communication costs - and it might even lead to lower power usage.Latency
Without the need to transfer data back and forth over the internet a lot of time is saved. Propagation delays are bounded by physics, see Ethernet Tx Delay - not easily shrunk. With a fast embedded system we will gain faster answers. Obviously the challenge is to speed up the embedded system with its lack of resources - incl power - this is one of the main tasks. Low latency is a must in self-driving cars and robots where we have real-time requirements.Robustness
When not depending on an internet connection, the system is more resilient. In medical devices, robots, cars etc. a briefly congested network connection may otherwise be a disaster. There is also the Denial-of-Service angle: If many embedded systems are depending on a single cloud-server, bringing down this server would affect all systems.Customization
With AI in the edge device, this device may be able to learn from its usage. In many scenarios, understanding a single user - or a small group of users - language, gestures, movements and use cases is much more important than being able to serve many different users. Surely this information might also be stored on a server, but such a solution is not really scaleable.
Types of AI
The following is a list of the various types of AI:
| Type | What it does | Examples |
|---|---|---|
| Generative AI | Creates new content | Text, images, music etc |
| Discriminative AI | Classifies or identifies data | Spam detection, face recognition |
| Reinforcement AI | Learning via feeback or rewards | Robotics, Games |
| Predictive AI | Forecasting | Weather, sales |
| Analytical AI | Finding patterns or clusters | Market segmentation |
The first systems I noticed were the "classifiers" - doing discriminative AI. You also see these when you dig into the many videos, articles and books about AI. You may present a system with an image of a cat, and the system says "cat". A common example is presenting the system with a handwritten digit (a number between 0 and 9), and the system detects the right value. As humans we are really good at this, but as a programmer you will probably agree that writing a C-program that "looks" at a number of pixels and deducts which digit a human has (hastily) written is not a simple task.
We also see a steady stream of videos of robots doing more and more impressive jobs - especially robots that are shaped as humans, walking, running, jumping - even playing soccer (sort of). Lately all the hype has been about Generative AI. However, I think that in embedded systems we may see that generative AI will be less dominating - simply because tasks like robotics and classification of e.g., fingerprints, faces or people or animals is often handled at the "edge".
Lay-man's explanation of how Generative AI works
In one of the videos I watched on AI, someone explained Generative AI like:What e.g. ChatGPT does is that it always looks at the input tokens (think of words) received until now - and then calculates possibilities for the next word in the sentence. It outputs the most likely word - and the process starts over - now with this word added. So when you see chatGPT writing its answer word-by-word, it's not an attempt to mimic an old typewriter or similar. No - each word is the result of a new calculation.
So - how does it get started? Basically the neural network starts with random weights in its matrices and then inputs your prompt. Now it starts calculating - what would be the most likely first word in response to that sequence of words?
In order to not simply repeat an existing article from the internet, it does not always output the word with the highest probability. Some randomness is introduced - allowing the machine to sometimes pick no 2 or 5 etc on the top-N of likely responses.
Because each word before is used in the calculation of the next, the machine has a "sense" of context. It will treat the word "bank" in one way in a prompt about interest-percentages and in another way in a prompt about rivers or nature.
The phases in Machine Learning
Normally when we talk about Machine Learning there are two main phases:In an embedded system, the training will typically take place in the R&D lab. The product is hereafter released as normally. This can be the classic sale of a full system - or the equally classic following firmware update. In the field, the system is doing its job. For the AI-part this means inferencing.
With an AI-based embedded product there are more phases than usual in a project. We have the usual project phases of understanding the problem to solve, gathering data etc. But now we also need to design our embedded AI sub-system. Ideally, this is part of the process of choosing the CPU/MCU. We will soon go deeper into this process.
Since there is no writing of software-algorithms, it is no wonder that the current AI-breakthrough is sometimes called "Software 2.0". We are starting to hear stories about mass-layoffs due to AI and similar stories about junior programmers having a hard time finding a job. AI is also creating new jobs - what will the net result be?
Advantages of AI in the embedded system
It is common in the AI-world to talk about "edge" systems. These are the systems at the edge of the internet - close to where they are used. There is an overlap with the term "embedded system". Some embedded systems are not connected - or only connected some of the time - and some are so big that we sometimes forget that they fall under the embedded definition, because they are still taylored for specific purposes - like a self-driving car.
Anyway, lets look at why it is interesting to run the normally extremely resource-hungry AI in small resource-limited systems:
Trade-offs in an embedded system
To get started on this section, I had a short dialog with ChatGPT, asking it to give me the trade-offs when dealing with AI in an embbedded system. I got the table below. The way to read it is a basically: "If I put/want more/bigger of the 'thing' in the first column - what will that mean to the following parameters (columns)?" A parameter may go up/down - shown as arrows - or may not change the parameter much - shown as "=".
Fun-fact: The arrow for Portability as function of Model Size (top right corner) at first pointed upwards. ChatGPT decided to reverse it when I asked whether this was really the case. This is quite in line with my general experience. Never take AI at face value. Obviously, this may be a serious limitation in many scenarios.
| Design Factor | Power | Latency | Cost | Memory | Accuracy | Development Complexity | Portability |
|---|---|---|---|---|---|---|---|
| Model Size | ↓ with small models | ↓ with small models | ↓ (less demanding HW) | ↓ (smaller footprint) | ↓ (less expressive) | ↓ (easier to deploy) | ↓ (runs on more devices) |
| Performance | ↑ (needs more power) | ↓ (faster inference) | ↑ (requires better HW) | ↑ (larger models) | ↑ (better predictions) | ↑ (optimized toolchains) | ↓ (hardware-specific) |
| Model Complexity | ↑ (more ops/sec needed) | ↑ (slower inference) | ↑ (advanced chips) | ↑ (more params) | ↑ (can learn more) | ↑ (tuning, training) | ↓ |
| Feature Set | ↑ (complex processing) | ↑ | ↑ | ↑ | ↑ | ↑ | ↓ |
| Hardware Optimization | ↓ (if efficient HW used) | ↓ (accelerated execution) | ↑ (specialized chips) | = | = | ↑ (integration effort) | ↓ (less portable) |
| Quantization / Pruning | ↓ | ↓ | ↓ | ↓ | ↓ (small loss) | ↑ (training, tuning needed) | ↑ |
| Cloud vs On-device | ↓ on-device, ↑ cloud | ↓ on-device, ↑ cloud | ↓ cloud, ↑ edge HW | ↓ cloud, ↑ on-device | ↑ cloud (larger models) | ↑ (network, sync issues) | ↓ (if cloud dependent) |
| Security Needs | ↑ (encryption, etc.) | ↑ (overhead) | ↑ | ↑ | = | ↑ | ↓ |
The Transformer
The following is an attempt of describing the Transformer a little bit better than earlier - using ChatGPT as an example. This means that I refer to "words" again - but in reality, anything can be tokenized. The below is mainly based on a presentation held by Grant Sanderson from 3blue1brown, for TNG's "Big Tech Day" in 2024 in Munich Transformers (YouTube).
When you input a prompt, the sentence is split in tokens which are the simple form of the words. Each token is associated with a vector of thousands of numbers from the pre-training. This socalled embedding also contains each word's position in the sentence. The collection of embeddings (associations for the full prompt) is now input to the many layers.
In the Attention layers, Query matrices are generated, where nouns effectively query backwards in the sentence for adjectives. Thus "hair" may be associated with "black" and "curly", if these terms are used earlier in the sentence. Key matrices associated with the adjectives, work together with the Query-matrices to work on the Value matrices - representing the information contained in the prompt.
All this becomes interleaved layers of Perceptrons. It sounds as if this term is also inspired by Transformer movies, but in fact it dates back to 1958, when Rosenblatt baptized his first single layer of "neurons". Some of the perceptron layers represent the built-in knowledge from the pre-training about "everything", and others are Attention layers, that represent the context of the prompt (and everything that follows in a dialog). The final (output) layer is a large vector of probabilities - each representing a bid on the next word. As stated earlier, a word with high probability is selected.
If the socalled Temperature parameter is configured very low, the word with the highest probability is selected. If you turn up the Temperature, the likelihood of choosing another high-ranking work grows - or ultimately a not-so-high ranking word.
Now the process starts over - including the new word. And so on. You may add a sentence to this in a continued dialog.
The above is as far as I am now - it will be revised as I learn.To be continued...