Collected ideas for Performance
The following is a list of performance Ideas. Some stem from one of my favorite books "Code Complete 2" by Steve McConnell:
- There are times where optimizing code makes a lot of sense. Optimizing algorithms, however, leads to gains in factor - not just percentage. Study the litterature. If you like to mainipulate bits and bytes, I can recommend "Hackers Delight" by Henry S. Warren Jr.
- Know your main compiler/linker optimizer choices (check with profiler and/or disassemble).
- If your CPU/MCU has cache(s) - it very often pays to optimize for size. This leads to fewer cache misses, becoming faster than optimizing for speed.
- When iterating through multi-dimensional arrarys, keep the rightmost index in the inner-loop.
- Iterate through arrays using pointers - when it makes a difference.
- Pass large objects to/from functions by pointer to avoid copying.
- Replace if-ladder with switch and replace switch with table (quality AND speed).
- Know and use available intrinsic functions. Examples of core functions and CMSIS lib on ARM: __ISB(), __REV(), __BKPT(), __WFI(). (Instruction-Synch Barrier, Reverse Byte order, Breakpoint, Wait-for-Interrupt (suspend power))
- Keep no of interrupts down - use HW FIFOs and DMA for I/O if possible.
- Save on critical function calls (but do not make code unreadable) - e.g., if a loop calls the same function many times - consider moving the loop into the function.
- If interrupt calls an "extern" function, all registers will be stacked - just in case. Avoid if possible.
- Divide by float/double constants by multiplying with the reciprocal - doing the division compile-time. GCC automatically does this with -Ofast, -ffast-math, -funsafe-math-optimizations, -freciprocal-math - BUT - dare we use “unsafe-math-operations”?
- Design structs with decreasing size of elements to preserve memory.
- Prefer float over double when range & accuracy allows. Consider int - not blindly! Note that Cortex-M4 may have FPU - but only for float - not double.
- Do NOT use char, uint8_t, int16_t etc instead of int to be faster. The CPU may spend a lot of time isolating that byte.
- When using floating-point - be sure to enable the FPU if available.
Other Guidelines for Embedded Systems (especially C-based)
Here are some more guidelines - several from "Code Complete 2" - not so much about performance:
- Study the datasheet & block diagram - understand the hardware’s functionality.
- Understand two’s-complement & hex-notation – and beware of C’s octal notation.
- Understand the concept of interrupts & threads.
- Use state-machines instead of multiple state booleans.
- Treat warnings as errors and suppress warnings with local pragmas when accepted.
- Don’t wait too long before testing on real hardware.
- Design error-handling.
- Decide early on internationalization (in short: i18n) or not.
- Use function headers (Doxygen), comments and whitespace.
- Understand endian-ness.
- Use the typesafe const instead of #define when possible.
If not possible then surround constants with parentheses. Say you write the following:
#define BACKOFF_IN_MS 1000*5+100 // 5s + 100 ms
You end up with 5500 ms - should have been 25500 ms!int wait_five_backoffs = BACKOFF_IN_MS * 5; - Use enum instead of multiple consts or defines
- Use meaningful names - include units like: "loadmA".
- Use the most narrow scope possible.
- Don't reuse variables.
- Limit use of "magic" numbers. Let's agree that 0 is not magic.
- Avoid empty "if" (with action in else).
- Use boolean functions in if-chains to preserve overview.
- Do use "short-circuit" evaluation to stepwise calculate conditions needed. Like:
if ((myptr != 0) && (myptr->subs != 0) && (myptr->subs->legal)) - Use DeMorgan to test what is most natural in the given context.
!(a && b) same as !a || !b!(a || b) same as !a && !b - Always have default in switch/case, but do not use the default for last case in set - like e.g., enum red, yellow, green gets cases red, yellow, default.
- When doing loops, keep control at start and end.
- When using for-loops, don't meddle with the loop-index halfway through.
NASA's Power of Ten
The following is a list on "NASA's Power of Ten" Guidelines for C-programs I found somewhere. I don't know whether it's genuine - but it's interesting:
- Simple control flow. No jumps or gotos. No recursiveness.
- All loops bound by an integer max. E.g. linked-list search.
- No use of heap - thus no malloc.
- Limit Function size. Max 60 lines – one paper. Single purpose.
- Use a minimum of two runtime asserts per function.
- Data Hiding. Declare variables at lowest scope.
- Check return values. If don’t care (e.g. printf) – signal by casting to void.
- Limit use of preprocessor. No ifdefs to various build-targets.
- Restrict the use of pointers to one level of dereferencing.
- Compile with –Wall –Werror – Wpedantic.
Here's an old NASA motto I picked up years ago:
Test what you fly - fly what you test.
Trick: The Fork (Assembly, Two's Complement)
This is a trick that I learned in Robin Sharp's class at DTU many years ago. It is only relevant if you need to save some time in an inner loop, and you are willing to write at least some instructions in assembly language.
Having worked with DSP in bit-slice and assembler, I have actually used this trick many times.
Problem: You need to know whether a number is within a certain range.
y-axis
|
|
|------- Upper Limit A (Not Included) ----------- e.g. 17
|
|
| Is X in this range?
|
|------- Lower Limit B (Included) ---------------- e.g 10
|
|
|
-----------------------------------------------------> X-axis
In the register where you have X - e.g. the accumulator - do the following in the relevant assembler language:
Sub A
Add (A-B)
JNC OutOfRange
After each of the first two operations, the carry will be set if the sign changes. In other words, in the above figure, crossing the X-axis gives a carry (incl going from negative to 0 or vice-versa).
Now, after the first two operations, the Carry is set if B <= X < A. This means that you get to do two tests with only one possible jump.
Desktop Test:
| X | SUB 17 | ADD 7 |
|---|---|---|
| -3 | -20,Cy=0 | -13,Cy=0 |
| 9 | -8, Cy=1 | -1, Cy=0 |
| 10 | -7, Cy=1 | 0, Cy=1 |
| 13 | -4, Cy=1 | 3, Cy=1 |
| 16 | -1, Cy=1 | 6, Cy=1 |
| 17 | 0, Cy=0 | 7, Cy=0 |
| 20 | 3, Cy=0 | 10, Cy=0 |