SIMD


The dead horse that I like to keep beating - everyone should use SIMD (single-instruction-multiple data) in their software wherever possible. It's basically free performance with no down side. Every PC (x86 machine) sold in the last 10 years has it and so do mobile/embedded machines for at least the last 5 years. What is SIMD? It's a set of powerful instructions which make use of extra-wide registers (typically 128-bits) and can do multiple operations in parallel. For example, a regular CPU instruction can perform a single math operation (e.g. integer addition), while a SIMD instruction can do 2, 4, 8, or 16 separate additions in parallel in the same amount of time. It means your program can do its work many times faster. Many programmers are already aware of the existence of SIMD, but assume that by enabling the "auto-vectorization" option of their C compiler, their program will magically contain beautifully crafted SIMD code. You can already guess from my last sentence that this is not always the case.

SIMD Challenges

  • Most software never makes use of these instructions because it takes extra effort to enable/add them to your code
  • Your algorithm needs to allow for operations to occur in parallel. Imaging/pixels are usually good candidates for SIMD optimization
  • Compilers are not very good at using SIMD automatically. This means you'll probably need to add intrinsics to your code to explicitly tell the compiler your intent
  • Each platform has its own unique SIMD instructions. There is great overlap between systems, but custom code must be written to take advantage of unique features in each platform.

With exactly the right conditions and a little help, some C compilers can vectorize some loops. What happens when the loop isn't so simple? The C compiler gives up and doesn't use SIMD. What is a simple loop? Here's an example of some C code that a modern C compiler can successfully vectorize into SIMD code. The code is arranged in nice multiples of the SIMD register size without anything special happening in the loop:




When the loop has a conditional statement in it, the compiler will give up trying to vectorize it. This is because in SIMD, the same operation is applied to all of the elements in each register. In the case of 128-bit SIMD, four 32-bit integers can be multiplied at the same time. If we add a condition such that an individual uint32_t might not get the same treatment, the compiler gives up on trying to optimize the code with SIMD. Here's an example:













SIMD instructions do include conditional operations, but they're not meant to be used for branching, instead, the resulting condition becomes a vector of flag bits (all 1's indicate true, all 0's indicate false). Maybe some day compilers will be smart enough to vectorize such loops, but I haven't seen any yet that can do it. With a little more effort (making using of SIMD intrinsics), it is possible to optimize such a loop to get better performance. Here's how to do it for the example above:


In the code above, I've written the conditional loop using Intel SSE intrinsics. The code would look virtually identical using other SIMD instruction sets (e.g. ARM NEON, Altavec, DSP). So the point of all of this - don't be afraid to write SIMD instructions in your programs to squeeze the maximum performance out of your hardware and don't assume your compiler is smart about using SIMD.

Comments

Popular posts from this blog

Fast SSD1306 OLED drawing with I2C bit banging

My adventures in writing an OTA bootloader for the ATmega128RFA1

Practical animation on I2C SSD1306 displays