SIMD
The dead horse that I like to keep beating - everyone should use SIMD (single-instruction-multiple data) in their software wherever possible. It's basically free performance, so why not use it? Every PC (x86 machine) sold in the last 10 years has it and so do mobile/embedded machines for at least the last 5 years. What is SIMD? It's a set of instructions which usually make use of extra-wide registers (typically 128-bits) and can do multiple operations in parallel. A regular CPU instruction can perform a single math operation (e.g. integer addition), while a SIMD instruction can do 2, 4, 8, or 16 separate additions in parallel in the same amount of time. It means your program can do its work many times faster. Many programmers are already aware of the existence of SIMD, but may assume that by enabling the "auto-vectorization" option of their C compiler, their program will magically contain beautifully crafted SIMD code. You can already guess from my last sentence that this is not always the case.
SIMD Challenges
- Most software never makes use of these instructions because it takes extra effort to add/maintain them in your code
- Your algorithm needs to allow for operations to occur in parallel. Imaging/pixels are usually good candidates for SIMD optimization
- Compilers are not very good at using SIMD automatically. This means you'll probably need to add intrinsics to your code to explicitly tell the compiler your intent
- Each platform has its own unique SIMD instructions. There is great overlap between systems, but custom code must be written to take advantage of unique features in each platform.
With exactly the right conditions and a little help, some C compilers can vectorize some loops. What happens when the loop isn't so simple? The C compiler gives up and doesn't use SIMD. What is a simple loop? Here's an example of some C code that a modern C compiler can successfully vectorize into SIMD code. The code is arranged in nice multiples of the SIMD register size without anything special happening in the loop:
When the loop has a conditional statement in it, most compilers will not be able to vectorize it. This is because in SIMD, the same operation is applied to all of the elements in each register. In the case of 128-bit SIMD, four 32-bit integers can be multiplied at the same time. If we add a condition such that an individual uint32_t might not get the same treatment, the compiler gives up on trying to optimize the code with SIMD. Here's an example:
SIMD instructions do include conditional operations, but they're not meant to be used for branching, instead, the resulting condition becomes a vector of flag bits (all 1's indicate true, all 0's indicate false). Maybe some day compilers will be smart enough to vectorize such loops, but I haven't seen any yet that can do it. With a little more effort (making using of SIMD intrinsics), it is possible to optimize such a loop to get better performance. Here's how to do it for the example above:
In the code above, I've written the conditional loop using Intel SSE intrinsics. The code would look virtually identical using other SIMD instruction sets (e.g. ARM NEON, Altavec, DSP). So the point of all of this - don't be afraid to write SIMD instructions in your programs to squeeze the maximum performance out of your hardware and don't assume your compiler is smart about using SIMD.
SIMD Downsides
So far I've only mentioned the benefits of using SIMD; there are costs associated with its use as well:
- Unique code must be written for each target CPU architecture
- Some CPU architectures (e.g. X86) must test for specific instruction availability
- Cross platform build scripts may increase in complexity to properly support SIMD
- New code means new opportunities for introducing new errors
- Harder to maintain if staff isn't familiar with SIMD
Like everything else in life, each project needs to weigh the costs versus the benefits of using SIMD. If speed or perhaps energy usage is high on the list of priorities, then using SIMD can be the reason a project succeeds.
stumbled into this, nice article! in the more recent crypto coins craze, I tried to optimize some arm codes doing Salsa20 using Arm Neon SIMD
ReplyDeletehttps://en.wikipedia.org/wiki/Salsa20
It looks very vectorizable, while doing that I used c based Arm intrinsics I keep different object files some optimized by gcc compilers using -O3 and -ftreevectorize and comparing that with my hand optimised versions.
It turns out after spending a lot of effort, gcc -O3 and -ftreevectorize actually ran quite a bit faster vs 'hand optimized' SIMD. After looking at the generated assembly, it turns out that actually not all codes are vectorised. Instead they are normal processor codes, and the advanced super-pipelines actually did a lot of instruction level parallelism (e.g. running codes ahead possibly processing many codes in advance in a single cycle). And that my naive optimization has severe bottlenecks as the swapping of vector registers isn't easy to do well, instead, in my case it breaks into 16 swapping instructions that will stall every cycle due to that swapping per register per lane.