Embedded Tips and Tricks

The  following various tips and tricks are in very different domains of Embedded Software. If you are as nerdy as I am, you may find it interesting to browse through. And who knows – you might fall over something that really benefits your work. In any case I enjoy writing it.

The Fork (Assembly, Two’s Complement)

This is a trick that I learned in Robin Sharp’s class at DTU many years ago. It is only relevant if you need to save some time in an inner loop, and you are willing to write at least some instructions in assembly language.

Having worked with DSP in bit-slice and assembler, I have actually used this trick many times.

Problem: You need to know whether a number is within a certain range.

|------- Upper Limit A (Not Included) ----------- e.g. 17
| Is X in this range?
|------- Lower Limit B (Included) ---------------- e.g 10
-----------------------------------------------------> X-axis

In the register where you have X  – e.g. the accumulator – do the following in the relevant assembler language:

Sub A
Add (A-B)
JNC OutOfRange

After each of the first two operations, the carry will be set if the sign changes.
In  other words, in the above figure, crossing the X-axis gives a carry (incl going from negative to 0 or vice-versa).

Now, after the first two operations, the Carry is set if B <= X < A. This means that you get to do two tests with only one possible jump.

Desktop Test:

X SUB 17 ADD 7
-3 -20,Cy=0 -13,Cy=0
9 -8, Cy=1 -1, Cy=0
10 -7, Cy=1 0, Cy=1
13 -4, Cy=1 3, Cy=1
16 -1, Cy=1 6, Cy=1
17 0, Cy=0 7, Cy=0
20 3, Cy=0 10, Cy=0

Alignment and Sign-Extension (C, Two’s Complement)

In my book Embedded Software for the IoT I discuss the so-called Q-notation for two’s complement numbers. To understand this, you first need to understand the basics of signed integers in general. This section is about a specific practical problem that demonstrates some of these basics.

Problem: Copying a number into a wider representation – e.g. 16 bits into 32.

Often you get data from an input in another width than the CPU’s typical word-width. This could e.g. be  16 or 24 bits from an A/D-converter into a 32-bit std CPU memory cell.

You need to decide whether to align left – leaving unused bits at “the bottom” of the word – or align right. I prefer right-alignment because it will allow you to later do arithmetic that “overflows” into the extra bits at the left. In other words; you get to utilize the higher dynamic range.

However if you simply copy the 16 or 24 bits into a 32-bit word you may get problems. Lets look at a simpler problem with 8-bits into 16:

The decimal value 18 – hexadecimal 0x12 in 8 bits – becomes 0x0012 in 16 bits. It’s the same – no problem.

However, -18 – hexadecimal 0xee – becomes 0x00ee. This is 238 decimal!

We need to sign-extend. The leftmost bit – the sign – must be copied into all the new bits above. Thus 0xee becomes 0xffee. If you add 0x12 to this you get 0 – as you should.

Here’s a quick way to do this:

dst = (src << 8) >> 8); // Assuming signed data vars

When shifting a signed value left, C will add zeroes at the bottom (right). When shifting right it will do an arithmetic shift which preserves the sign-bit. The above operation first shifts data 8 bits up – letting zeroes in at the LSB (least Significant Bit). Then it shifts down again – now copying the old sign-bit into the new.

Strictly speaking, C does not guarantee what happens when you shift left into the sign-bit, but most compilers treat the left shift as a logical shift and the right shift as an arithmetic shift.

If you are careful, you may use casts to shift left in unsigned data and shift right in signed data.

If you are even more careful and not extremely worried about a few extra cycles, you can do the following – now using 24 bits in 32-bit word-size:

if (src | 0x00800000)
   dst = src | 0xff000000;
   dst = src;

Or – the compact way:

dst = (src | 0x00800000) ? src | 0xff000000 : src;

Understand UTF-8 and how to deal with it (C, Unicode)

If you don’t have any textual communication with end-users in your application, you may be able to get away with the old ASCII characters in 8 bits, supported in C as char. In almost every other case, you need to learn about UTF-8 as it is the way the world is going. Even Microsoft has accepted UTF-8 as the standard of the future. The path to this unity has been long. You may find all sorts of advice in semi-new articles, but beware!

Problem: C is not created for Unicode, but we need new and old programs to work in an international environment.

The major downside of UTF-8 is that there is not a simple formula to go from number of characters to bytes or the other way. Thus you cannot easily index into a string – or subtract one pointer from another to find the number of characters between these.

The upside is that most existing C-functions actually still work. UTF-8 characters take up 1 to 4 (in principle 6) bytes of space. The first 127 characters of the ASCII alphabet are passed “as is” in a single UTF-8 byte. Equally important: Each of the extra bytes added to encode other languages has the most significant bit set. These two facts mean that existing code searching for zero-termination, “/” or other control characters still works.

UTF-8 “multibytes” can be converted into “widechars” of fixed 32 bits for internal use in a program. You typically use the built-in type wchar_t for this, but unfortunately some platforms only allocate 16 bits for this – so beware.

In most cases the C standard library will help you and your program to support the relevant locale in UTF-8. Read e.g. Markus Kuhn’s intro to UTF-8

Exploring the UTF-8 thing (WSL with classic tools)

I wanted to test Windows Subsystem for Linux.  The installer halted in the installation of the default Ubuntu. Instead I installed it with Debian Linux. I used it for the following exploration of UTF-8, using basic Linux tools.

A short program with the Danish characters – æ, ø and å is written. The (two) UTF-8 bytes for these are given below – in hex, decimal and octal, to comply with the tools used:

UFF-8: Hex     Dec       Oct
æ:    C3 A6   195 166   303 246
ø:    C3 B8   195 184   303 270
å:    C3 A5   195 165   303 245

The Program and gcc

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main(int argc, char* argv[])
   if (!setlocale(LC_CTYPE, "")) 
      fprintf(stderr, "Problems setting locale! Check LANG, LC_CTYPE, LC_ALL.\n");

   char classic[] = "æøå";
   wchar_t wide[] = L"æøå";

   printf("Classic æøå string: %s\n", classic);
   printf("Wide æøå string: %ls\n", wide);

   for (int i=0; i < 3; i++)
      printf("classic[%d] = %c, wide[%d] = %lc\n", i, classic[i],i, wide[i]);

   printf("Size of Classic: %d, Wide: %d\n", sizeof(classic), sizeof(wide));
   return 0;

To provide debug-info, this is compiled in gcc with the “-g” option:

$ gcc myprog.c -o myprog.exe -g

When run, the program gives the following output:

Classic æøå string: æøå
Wide æøå string: æøå
Size of Classic: 7, Wide: 16
classic[0] = , wide[0] = æ
classic[1] = , wide[1] = ø
classic[2] = , wide[2] = å

Note that the string “æøå” is printed correctly in all three cases: from within the printf format string, from the “classic” char-string and from the wide-char string. The wide-char string takes up 16 bytes – 4 bytes for each of æøå and ‘\0’ – no surprise there. However, more interesting:  the “classic” string fills up 7 bytes: 2 per æøå, and one for the ‘\0’ termination.

Thus the wide-char string is internally handled as 32-byte chars – as expected, but the “char” string is actually stored as UTF-8 chars! This shows when we try to index into the strings – this only works with wide-char strings.

Except for the indexing, it all works behind the scenes in a very normal-looking code.

GDB, strace and od

The UTF-8 layout of the char-array is confirmed when I break in GDB , and the bytes in the “classic” string are:

0x7ffffffee2f9: -61 -90 -61 -72 -61 -91 0

The above is given as negative numbers. If you add 256 to each you will see that it fits with the three UTF-8 chars in the table above – and a ‘\0’ termination.

Now for the strace. This is a nice tool that shows how the OS uses libraries in the program execution. It works flawlessly in WSL, which shows the quality.

Here we mainly see the same pattern – just a bit confusing because octal notation is used. We see that when indexing into the “classic” string we get the first three bytes of the dump above – if we add 256 and convert to octal 🙂

Note that in the write calls –  result of the printf – we see that the library at this point has substituted the 32-bit chars in the wide-array with the corresponding UTF-8 chars.

$ strace ./myprog.exe
.... skipping lines ...
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1679776, ...}) = 0
mmap(NULL, 1679776, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa86b855000
close(3) = 0
fstat(1, {st_mode=S_IFCHR|0660, st_rdev=makedev(4, 1), ...}) = 0
ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
write(1, "Classic \303\246\303\270\303\245 string: \303\246\303\270\303\245\n", 30Classic æøå string: æøå
) = 30
open("/usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=26258, ...}) = 0
mmap(NULL, 26258, PROT_READ, MAP_SHARED, 3, 0) = 0x7fa86b9ff000
close(3) = 0
write(1, "Wide \303\246\303\270\303\245 string: \303\246\303\270\303\245\n", 27Wide æøå string: æøå
) = 27
write(1, "Size of Classic: 7, Wide: 16\n", 29Size of Classic: 7, Wide: 16
) = 29
write(1, "classic[0] = \303, wide[0] = \303\246\n", 29classic[0] = , wide[0] = æ
) = 29
write(1, "classic[1] = \246, wide[1] = \303\270\n", 29classic[1] = , wide[1] = ø
) = 29
write(1, "classic[2] = \303, wide[2] = \303\245\n", 29classic[2] = , wide[2] = å
) = 29
exit_group(0) = ?
+++ exited with 0 +++

If we do NOT include the setlocale call in the above program, the classic string does not change behaviour. However, the wide string does not printout as it should.

Finally we look at what the OS and the program actually did output. Here I pipe the output to a file and then dump it –  with ASCII chars whenever possible, and octal values for the rest. Clearly the OS makes our strings UTF-8. Again we see how indexing in the classic string using byte-chars gives us incomplete UTF-8 chars.

$./myprog.exe > out.txt
$od -c out.txt
0000000 C l a s s i c 303 246 303 270 303 245 s
0000020 t r i n g : 303 246 303 270 303 245 \n W i
0000040 d e 303 246 303 270 303 245 s t r i n g
0000060 : 303 246 303 270 303 245 \n S i z e o f
0000100 C l a s s i c : 7 , W i d
0000120 e : 1 6 \n c l a s s i c [ 0 ]
0000140 = 303 , w i d e [ 0 ] =
0000160 303 246 \n c l a s s i c [ 1 ] =
0000200 246 , w i d e [ 1 ] = 303 270 \n
0000220 c l a s s i c [ 2 ] = 303 ,
0000240 w i d e [ 2 ] = 303 245 \n

Back to WSL: All the nice tools work as if on Linux , and yet I can edit “from the Windows side”- very nice. WSL stores the Linux tree deep down in the folder structure, but you can easily work on files in the “normal” Windows. Start the path with “/mnt/c” and then business as usual with “/” instead of “\”.

RTC (Architecture, Scheduling)

When we say “RTC ” we normally mean “Real-Time-Clock”. But the acronym also means “Run To Completion”. In my book Embedded Software for the IoT, I mainly discuss larger Operating Systems like e.g. Linux, Windows NT and FreeRTOS. There are however, scenarios where you can do with something much leaner. If your microprocessor needs to do selected few – but important – jobs, and you don’t need the drivers and abundant catalog of software that comes with Linux, a small scheduler can be your solution.

The larger OS’es support preemptive processing where the kernel can remove any process from the CPU at any time. Opposed to this concept, “Run To Completion” means that once the code is started, it will run until done. It might still be preempted by something of a higher priority, but it will not be relieved of its duties because its time-slice is up.

Where the larger OS’es have threads and/or processes that may block, but typically run in infinite loops, the RTC-concept will start small jobs that run until they are done. This can be repeated ad-infinitum, but it’s still a different way of thinking than the infinite loop.

With RTC there are still priorities. The lowest priority is the normal execution context. Higher priorities run at increasingly higher prioritized interrupts – starting from the low end, so that “real” interrupts from hardware always take precedence. Obviously, we here abandon the normal “guerilla” concept for interrupts, where we do as little as possible at interrupt context, and defer the rest until later.

Typically a job can be scheduled from any priority to be run at any priority by a function call to the scheduler. The scheduler attaches the job to a queue that is specific to the given priority. It then sets the relevant interrupt-flag (unless it’s the lowest priority). If we are already executing at a higher priority nothing more happens now, but when this is no longer the case, the interrupt occurs. In the interrupt routine the oldest job for the corresponding priority is now executed.

A “semi-RTC” OS (my term) will not necessarily demand that jobs run to completion – but it is still non-preemptive. Such a system can “yield” the CPU when “convenient”. This is what the original Windows OS did, and it allows for more classic infinite loops. Any OS-call could lead to a yield as far as I remember. I remember inserting Sleep(0) at various places in lengthy code, to allow the user-interface (e.g. mouse) to appear more alive.

There are several drawbacks with RTC:

  • A lot of code will be executed at interrupt context.
  • It is not easy to use open-source code – or most other larger libraries.
  • Access rights etc are not possible as long as “everybody can schedule everybody”. In a small embedded system this is typically not an issue.

There are also important advantages:

  • The scheduler is extremely lean & mean – using very little CPU and very little memory. It only runs when needed – very event-based.
  • Timing restraints are easy to understand. Highest priority can basically keep the CPU as long as needed. What’s left is available to the next-highest etc.
  • Since any preemption is done by interrupts, we can utilize the single CPU stack – no need for a stack per thread/process.

Since you can also schedule jobs from a timer, you can implement timeouts. Thus you could e.g. implement a full TCP/IP stack with RTC – although it would be a lot of work.