Operating Systems and their Challenges

This page is dedicated to OS-near subjects.


Understanding UTF-8 and how to deal with it (C, Unicode)

If you don’t have any textual communication with end-users in your application, you may be able to get away with the old ASCII characters in 8 bits, supported in C as char. In almost every other case, you need to learn about UTF-8 as it is the way the world is going. Even Microsoft has accepted UTF-8 as the standard of the future. The path to this unity has been long. You may find all sorts of advice in semi-new articles, but beware!

Problem: C is not created for Unicode, but we need new and old programs to work in an international environment.

The major downside of UTF-8 is that there is not a simple formula to go from number of characters to bytes or the other way. Thus you cannot easily index into a string – or subtract one pointer from another to find the number of characters between these.

The upside is that most existing C-functions actually still work. UTF-8 characters take up 1 to 4 (in principle 6) bytes of space. The first 127 characters of the ASCII alphabet are passed “as is” in a single UTF-8 byte. Equally important: Each of the extra bytes added to encode other languages has the most significant bit set. These two facts mean that existing code searching for zero-termination, “/” or other control characters still works.

UTF-8 “multibytes” can be converted into “widechars” of fixed 32 bits for internal use in a program. You typically use the built-in type wchar_t for this, but unfortunately some platforms only allocate 16 bits for this – so beware.

In most cases the C standard library will help you and your program to support the relevant locale in UTF-8. Read e.g. Markus Kuhn’s intro to UTF-8

Exploring the UTF-8 thing (WSL with classic tools)

I wanted to test Windows Subsystem for Linux.  The installer halted in the installation of the default Ubuntu. Instead I installed it with Debian Linux. I used it for the following exploration of UTF-8, using basic Linux tools.

A short program with the Danish characters – æ, ø and å is written. The (two) UTF-8 bytes for these are given below – in hex, decimal and octal, to comply with the tools used:

UFF-8: Hex     Dec       Oct
æ:    C3 A6   195 166   303 246
ø:    C3 B8   195 184   303 270
å:    C3 A5   195 165   303 245

The Program and gcc

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main(int argc, char* argv[])
   if (!setlocale(LC_CTYPE, "")) 
      fprintf(stderr, "Problems setting locale! Check LANG, LC_CTYPE, LC_ALL.\n");

   char classic[] = "æøå";
   wchar_t wide[] = L"æøå";

   printf("Classic æøå string: %s\n", classic);
   printf("Wide æøå string: %ls\n", wide);

   for (int i=0; i < 3; i++)
      printf("classic[%d] = %c, wide[%d] = %lc\n", i, classic[i],i, wide[i]);

   printf("Size of Classic: %d, Wide: %d\n", sizeof(classic), sizeof(wide));
   return 0;

To provide debug-info, this is compiled in gcc with the “-g” option:

$ gcc myprog.c -o myprog.exe -g

When run, the program gives the following output:

Classic æøå string: æøå
Wide æøå string: æøå
Size of Classic: 7, Wide: 16
classic[0] = , wide[0] = æ
classic[1] = , wide[1] = ø
classic[2] = , wide[2] = å

Note that the string “æøå” is printed correctly in all three cases: from within the printf format string, from the “classic” char-string and from the wide-char string. The wide-char string takes up 16 bytes – 4 bytes for each of æøå and ‘\0’ – no surprise there. However, more interesting:  the “classic” string fills up 7 bytes: 2 per æøå, and one for the ‘\0’ termination.

Thus the wide-char string is internally handled as 32-byte chars – as expected, but the “char” string is actually stored as UTF-8 chars! This shows when we try to index into the strings – this only works with wide-char strings.

Except for the indexing, it all works behind the scenes in a very normal-looking code.

GDB, strace and od (GNU tools)

The UTF-8 layout of the char-array is confirmed when I break in GDB , and the bytes in the “classic” string are:

0x7ffffffee2f9: -61 -90 -61 -72 -61 -91 0

The above is given as negative numbers. If you add 256 to each you will see that it fits with the three UTF-8 chars in the table above – and a ‘\0’ termination.

Now for the strace. This is a nice tool that shows how the OS uses libraries in the program execution. It works flawlessly in WSL, which shows the quality.

Here we mainly see the same pattern – just a bit confusing because octal notation is used. We see that when indexing into the “classic” string we get the first three bytes of the dump above – if we add 256 and convert to octal 🙂

Note that in the write calls –  result of the printf – we see that the library at this point has substituted the 32-bit chars in the wide-array with the corresponding UTF-8 chars.

$ strace ./myprog.exe
.... skipping lines ...
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1679776, ...}) = 0
mmap(NULL, 1679776, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa86b855000
close(3) = 0
fstat(1, {st_mode=S_IFCHR|0660, st_rdev=makedev(4, 1), ...}) = 0
ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
write(1, "Classic \303\246\303\270\303\245 string: \303\246\303\270\303\245\n", 30Classic æøå string: æøå
) = 30
open("/usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=26258, ...}) = 0
mmap(NULL, 26258, PROT_READ, MAP_SHARED, 3, 0) = 0x7fa86b9ff000
close(3) = 0
write(1, "Wide \303\246\303\270\303\245 string: \303\246\303\270\303\245\n", 27Wide æøå string: æøå
) = 27
write(1, "Size of Classic: 7, Wide: 16\n", 29Size of Classic: 7, Wide: 16
) = 29
write(1, "classic[0] = \303, wide[0] = \303\246\n", 29classic[0] = , wide[0] = æ
) = 29
write(1, "classic[1] = \246, wide[1] = \303\270\n", 29classic[1] = , wide[1] = ø
) = 29
write(1, "classic[2] = \303, wide[2] = \303\245\n", 29classic[2] = , wide[2] = å
) = 29
exit_group(0) = ?
+++ exited with 0 +++

If we do NOT include the setlocale call in the above program, the classic string does not change behavior. However, the wide string does not printout as it should.

Output to File

Finally we look at what the OS and the program actually did output. Here I pipe the output to a file and then dump it –  with ASCII chars whenever possible, and octal values for the rest. Clearly the OS makes our strings UTF-8. Again we see how indexing in the classic string using byte-chars gives us incomplete UTF-8 chars.

$./myprog.exe > out.txt
$od -c out.txt
0000000 C l a s s i c 303 246 303 270 303 245 s
0000020 t r i n g : 303 246 303 270 303 245 \n W i
0000040 d e 303 246 303 270 303 245 s t r i n g
0000060 : 303 246 303 270 303 245 \n S i z e o f
0000100 C l a s s i c : 7 , W i d
0000120 e : 1 6 \n c l a s s i c [ 0 ]
0000140 = 303 , w i d e [ 0 ] =
0000160 303 246 \n c l a s s i c [ 1 ] =
0000200 246 , w i d e [ 1 ] = 303 270 \n
0000220 c l a s s i c [ 2 ] = 303 ,
0000240 w i d e [ 2 ] = 303 245 \n

Back to WSL: All the nice tools work as if on Linux , and yet I can edit “from the Windows side”- very nice. WSL stores the Linux tree deep down in the folder structure, but you can easily work on files in the “normal” Windows. Start the path with “/mnt/c” and then business as usual with “/” instead of “\”.

RTC (Architecture, Scheduling)

When we say “RTC ” we normally mean “Real-Time-Clock”. But the acronym also means “Run To Completion”. In my book Embedded Software for the IoT, I mainly discuss larger Operating Systems like e.g. Linux, Windows NT and touch on smaller RTOS’es like FreeRTOS (also used in the newer book Microcontrollers with C). There are however, scenarios where you can do with something much leaner. If your microprocessor needs to do selected few – but important – jobs, and you don’t need the drivers and abundant catalog of software that comes with Linux, a small scheduler can be your solution.

The larger OS’es support preemptive processing where the kernel can remove any process from the CPU at any time. Opposed to this concept, “Run To Completion” means that once the code is started, it will run until done. It might still be preempted by something of a higher priority, but it will not be relieved of its duties because its time-slice is up.

Where the larger OS’es have threads and/or processes that may block, but typically run in infinite loops, the RTC-concept will start small jobs that run until they are done. This can be repeated ad-infinitum, but it’s still a different way of thinking than the infinite loop.

With RTC there are still priorities. The lowest priority is the normal execution context. Higher priorities run at increasingly higher prioritized interrupts – starting from the low end, so that “real” interrupts from hardware always take precedence. Obviously, we here abandon the normal “guerilla” concept for interrupts, where we do as little as possible at interrupt context, and defer the rest until later.

Typically a job can be scheduled from any priority to be run at any priority by a function call to the scheduler. The scheduler attaches the job to a queue that is specific to the given priority. It then sets the relevant interrupt-flag (unless it’s the lowest priority). If we are already executing at a higher priority nothing more happens now, but when this is no longer the case, the interrupt occurs. In the interrupt routine the oldest job for the corresponding priority is now executed.

A “semi-RTC” OS (my term) will not necessarily demand that jobs run to completion – but it is still non-preemptive. Such a system can “yield” the CPU when “convenient”. This is what the original Windows OS did, and it allows for more classic infinite loops. Any OS-call could lead to a yield as far as I remember. I remember inserting Sleep(0) at various places in lengthy code, to allow the user-interface (e.g. mouse) to appear more alive.

There are several drawbacks with RTC:

  • A lot of code will be executed at interrupt context.
  • It is not easy to use open-source code – or most other larger libraries.
  • Access rights etc are not possible as long as “everybody can schedule everybody”. In a small embedded system this is typically not an issue.

There are also important advantages:

  • The scheduler is extremely lean & mean – using very little CPU and very little memory. It only runs when needed – very event-based.
  • Timing restraints are easy to understand. Highest priority can basically keep the CPU as long as needed. What’s left is available to the next-highest etc.
  • Since any preemption is done by interrupts, we can utilize the single CPU stack – no need for a stack per thread/process.

Since you can also schedule jobs from a timer, you can implement timeouts. Thus you could e.g. implement a full TCP/IP stack with RTC – although it would be a lot of work.


Many tool-chains – including GCC – will create a memory layout as below:

|           |
|           | High Address 
| stack     |
|           | Stack grows downwards 
| RAM       |
|           |
|           |               
| heap      | heap grows upwards 
|           |
| RAM       |
| "bss"     |
|           |
| un-init   |
| vars      |
| RAM       |
| "data"    |
| init'ed   |
| vars      |
|           |
| RAM       |
| "text"    |
| code      |
| constants | Low Address
| Flash     | 
| or RAM    |

Use the “size” command to see the size of the Text, Data and BSS segments – and their sum in decimal and hex.

$ size myprog.exe
text data bss dec hex filename
2272 616 16 2904 b58 myprog.exe

The BSS, Data and Text segments are statically structured by the linker & loader. This means that even before the program is run, you can see in the map-file where which variable is stored and how many bytes it occupies.

The content that is not statically allocated by the linker is found in the heap and the stack.

The heap is for memory, dynamically allocated with “malloc” – in C – and with “new” – in C++. In most systems the heap grow upwards (incrementing addresses).

Stack checks

The stack is used for local variables in functions – unless they are declared “static”. In most systems the stack grows downwards.

It is clear that there is a risk that the stack grows down into the heap or the heap grows up into the stack. This will look like random memory overwrites and can be difficult to find. With the stack affected, the CPU will likely execute “bad code” and run into a halt.

A static call-three analysis may tell you the max stack-depth of a normal set of calls. However, you must be prepared to add more for interrupts, which may be nested. Many OS’es have security mechanisms that will warn you when this happens. If your OS does not have such a system you can make one yourself. Declare a variable (or maybe a small array) and assign it to the segment that your system is using for the heap:

static __attribute__((section ("._seg_heap"))) uint32_t __stackheapguard = 0xDEADFACE;

Now check regularly that this variable is not changed – e.g. in a low-priority task. During development an assert can be used.

While you are at it, you might also want enable a check – during debug –  by setting the linker flag: “-fstack-protector-strong”. This will cause the compiler to insert checkpoints per function that use local variables – catching classic “buffer overruns” and other errors that lead into out-of-bounds indexing into arrays.

Handling the heap

While the stack sort of “cleans up after itself”, the heap needs “free” – for C – or “delete” – in C++ – to be called at the right places. If one of these is forgotten, we have a memory leak. After some time of constant leaking, the program will likely crash. Even if allocation and deallocation is correctly matched, you may run into a “fragmented” memory. This can also lead to crashes, when the program attempts a malloc of a size that does not exist as one piece of contiguous memory.

The best cure for the dynamic memory allocation problems, is not to use dynamic memory allocation! Instead allocate everything statically. This is not always so easy. Instead you may want to use pools of fixed-size buffers. This will fix the fragmentation problem. The code may monitor how many buffers that are free within each pool. Again in Debug, it may assert the system if a pool is running low. Knowing which parts of the code uses which size buffers will help find the villain.

Code for larger OS’es like Linux do use dynamic memory. This is basically something we need to accept.

Memory Manager

The previous chapter is specially relevant for smaller microprocessors running a single exe or bin with no memory-manager.  If the system does have a memory-manager, it can help catch things like the stack writing into the heap.

A memory-manager is a prerequisite for Virtual Memory. If you have a virtual-memory based system, the linker map does not directly show the static address of variables in the Text, BSS and Data segments. If the system allows the load of several exe-files (like Linux) there will be equally many map-files, and the whole thing becomes more complex. However, because of the help you get from the memory-manager you probably don’t need to follow variables through map-files and their final addresses.

Fast and slow RAM

We cannot leave the subject of memory without touching upon the more physical division of memory into segments. Some CPU’s offer parts of memory that are faster than others – and we are not talking about caches here. These systems allow you to match specific parts of your data (and code) with faster RAM than others.