The dead horse that I like to keep beating - everyone should use SIMD (single-instruction-multiple data) in their software wherever possible. It's basically free performance, so why not use it? Every PC (x86 machine) sold in the last 10 years has it and so do mobile/embedded machines for at least the last 5 years. What is SIMD? It's a set of instructions which usually make use of extra-wide registers (typically 128-bits) and can do multiple operations in parallel. A regular CPU instruction can perform a single math operation (e.g. integer addition), while a SIMD instruction can do 2, 4, 8, or 16 separate additions in parallel in the same amount of time. It means your program can do its work many times faster. Many programmers are already aware of the existence of SIMD, but may assume that by enabling the "auto-vectorization" option of their C compiler, their program will magically contain beautifully crafted SIMD code. You can already guess from my last sentence that this is not always the case.
- Most software never makes use of these instructions because it takes extra effort to add/maintain them in your code
- Your algorithm needs to allow for operations to occur in parallel. Imaging/pixels are usually good candidates for SIMD optimization
- Compilers are not very good at using SIMD automatically. This means you'll probably need to add intrinsics to your code to explicitly tell the compiler your intent
- Each platform has its own unique SIMD instructions. There is great overlap between systems, but custom code must be written to take advantage of unique features in each platform.
With exactly the right conditions and a little help, some C compilers can vectorize some loops. What happens when the loop isn't so simple? The C compiler gives up and doesn't use SIMD. What is a simple loop? Here's an example of some C code that a modern C compiler can successfully vectorize into SIMD code. The code is arranged in nice multiples of the SIMD register size without anything special happening in the loop: