The case of the missing SIMD code

Intro

Much of the open source software that we all use daily is not optimized, or at least not well optimized. I've talked about this before because it's an issue I feel very strongly about. Code that is run billions of times a day shouldn't be inefficient. In the last 10 years (especially in the last year), more of our lives are spent in front of a computer than ever. Computers have a non-trivial environmental impact. Recently Bitcoin's environmental impact has popped up in the news, but the overall energy use of everyday software in both server farms and homes and offices is significant and steadily rising. There are plenty of people working diligently to make sure the code which runs the world is efficient, but what if some wrong assumptions were holding them back? SIMD instructions are the key to unlocking efficient software on both PCs and mobile devices, but the problem is that it's not used everywhere that it needs to be. You can find SIMD code in many popular code libraries; it's not always optimal and sometimes it's missing entirely. Often times someone has made some effort to put optimized SIMD code into a library and has left it unfinished or not optimal. This bothers me specifically because there is an implied level of trust in SIMD code authors. They need additional knowledge to make use of the vector capabilities of the CPU and this imbues an inherent trust that they know how to write efficient code. When I find inefficient code in widely used libraries, I feel the need to help. Another reason this bothers me is because SIMD code in open source libraries tends to never change. Once the SIMD 'achievement' has been reached, it is virtually guaranteed that it will never be looked at again. The point of this blog post is to not only bring attention to important libraries which need additional optimization effort, but also to a programming topic that doesn't seem to be mentioned anywhere - namely that SIMD code can sometimes speed up "single lane" algorithms too. I've observed that single-lane SIMD is sometimes generated by C compilers, so why not write SIMD intrinsics manually when the situation warrants it?

libPNG

For this article, I'm going to focus on the missing SIMD code in libPNG. This is one of those critically important libraries that is used daily by billions of people and yet it has languished with sub-optimal code for several years. For most images, the majority of time in PNG encoding and decoding is spent in the flate or deflate code, but the image filters can take a significant amount of time too. That is the part that can be optimized with SIMD. PNG filters improve the compressibility of the data by exploiting repeating patterns to reduce the total number of unique symbols to compress. In simpler terms, a PNG filter ideally will make most of the image data into 0's by replacing the pixel value with the difference between it and its neighbors. If a majority of the data is various groupings of 0's, it can be more easily compressed compared to the unfiltered data. In 2016 there was an effort to optimize libpng with SIMD. The author(s), however stopped short of writing SIMD versions of all of the filter functions that needed it because of a wrong assumption. I had a strong hunch that SIMD could help speed up the functions deemed 'not worthy' of hand-optimized code.

The Work

I chose a very large (10000 x 10000 pixel) grayscale PNG file to test decoding speed. The larger the better, so that I could easily see where the time is being spent in the profiler. For my testing, I wrote a simple test program which decodes the image 20x. I enabled the SSE4.1 vector extensions in XCode and enabled the SIMD optimized filters in the libpng build configuration. Now to run my test image in the profiler and see what happens:

As I described earlier, the filter stage can take quite a bit of time. In this case, around 50% of the decode time is spent in inflate and 50% is spent in the de-filtering step. This seemed out of proportion considering that the SIMD code was enabled, so I stepped through the filter setup code and found this:

Oops - No one wrote SIMD code for 1-byte-per-pixel PNG files (grayscale or palette color pixels). There are also some missing (unoptimized) filters for every pixel type. In the code above, 'bpp' refers to the number of bytes per pixel. With the profiler, we can look a little deeper.

We can see that the Paeth filter is actually taking MORE time than the inflate code and the other filters' time falls off quickly from there. Let's write an optimized Paeth filter and see how that affects the performance.

My SIMD Paeth filter is quite a bit faster. You can see that it has shaved about 20% off of the total decode time. Let's take a step back and see why I was so confident that I could gain some speed with a SIMD implementation on x86 even though the Paeth (1-bpp) filter requires working with 1 pixel at a time.

This is the original Paeth filter code. In the right column is the x64 assembly language generated by the compiler when set to maximum optimization (-O3). Right away I saw 3 problems that could be improved on x86 by using SIMD code:

There is no scalar instruction for calculating the absolute value, so it uses a negate and conditional move instead. Starting at line 23, you can see the 3 instruction sequence needed to accomplish the absolute value calculation.
Scalar instructions don't have a way of managing conditional statements without using a branch. Starting at line 31 you can see there is a comparison followed by a "jl - jump if less". Branching is expensive to do on a pipelined CPU.
Finally you can see how expensive it is to read/write one byte at a time (lines 37-40). Scalar code can access memory at up to the native integer size (64-bits), but SIMD still has the advantage here with 128-bit access. On the latest generations of Intel CPUs, memory access is much slower than instruction execution even if that memory comes from L0 cache. Memory access isn't always much slower (e.g. Apple M1 memory latency is quite low), but typical mobile and desktop CPUs have this in common.

A quick explanation of the SIMD instructions used below
For those of you already familiar with Intel SIMD (single-instruction/multiple-data), please skip to the next section. SIMD are CPU instructions which work on wide registers (in this case 128-bits) and can treat those bits as groups of different sized integers or floats. For example, with 128-bits we can work on 16 8-bit integers (pixels) at a time or 8 16-bit integers. SIMD instructions in most cases work as a load/store machine. This means that memory needs to be loaded into registers first, then operated on with arithmetic ops and then the results can be stored back to memory. Here is a brief description of the instructions I use in the example of the next section:

reg = _mm_loadu_si128((__m128i*)src_ptr)	Load 128-bits into a register from a potentially unaligned address
_mm_storel_epi64((__m128i*)dst_ptr, a)	Store the lower 64-bits of a __m128i register
reg = _mm_set_epi8(...)	Load 16 8-bit values into a register
reg = _mm_sub_epi16(__m128i a, __m128i b)	Subtract 8-bit elements of b from a
reg = _mm_packus_epi16(__m128i a, __m128i b)	Pack (narrow) 16-bit integers into 8-bits
reg = _mm_unpacklo_epi8(__m128i a, __m128i b)	Interleave 8-bit elements of a and b
reg = _mm_srli_si128(__m128i a, int n)	Shift register a right by n bytes
reg = _mm_extract_epi8(__m128i a, int n)	Extract 8-bit lane n of a
reg = _mm_insert_epi8(__m128i a, int b, int n)	Insert b into 8-bit lane n of a
reg = _mm_setzero_si128()	Set a register to all 0's

Here's a great guide to all of Intel's SIMD instructions:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

SIMD

Here's part of the SSE2 code I wrote for the Paeth filter. Newer instructions (e.g. AVX2) would simplify this slightly, but the compiler is actually smart about replacing some of my intrinsics with better replacements. I'm confident that someone else can improve this further, but it's still impressive how much faster it is compared to the scalar code. The calculations require carrying a 16-bit value from the 8-bit pixel differences, so I decided to treat all of the pixel data as 16-bit by expanding it with _mm_unpacklo_epi8. I also preload the next 8 pixels to help hide some of the memory latency.

xmmB = _mm_loadu_si128((__m128i*)prev); // B's & C's (previous line)

xmmRow = xmmRowNext;

xmmRowNext = _mm_loadu_si128((__m128i*)&row[8]); // current pixels

xmmC = _mm_unpacklo_epi8(xmmB, _mm_setzero_si128());

xmmB = _mm_unpacklo_epi8(_mm_srli_si128(xmmB, 1), _mm_setzero_si128());

// a &= 0xff; /* From previous iteration or start */

xmmA = _mm_unpacklo_epi8(xmmRow, _mm_setzero_si128());

xmmRow = _mm_unpacklo_epi8(_mm_srli_si128(xmmRow, 1), _mm_setzero_si128());

xmmP = _mm_sub_epi16(xmmB, xmmC); // p = b - c

I then loop over the 8 pixels I've read and shift everything down 1 pixel each iteration. Each processed pixel is then extracted from one register and inserted into another. To reduce the number of registers required, I created a ROR (rotated right) operation on SIMD registers by using the _mm_shuffle_epi8 instruction with the following byte pattern:

xmmRotate = _mm_set_epi8(1, 0, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2); // shuffle mask to rotate right (ROR)

At the end of the loop, I pack the 8 finished 16-bit pixels back into bytes and write them to the output.

xmmRow = _mm_packus_epi16(xmmA, xmmA);

// Need to insert the next finished pixel into the prefetched next set of 8

xmmRowNext = _mm_insert_epi8(xmmRowNext, _mm_extract_epi8(xmmA, 14), 0);

_mm_storel_epi64((__m128i*)&row[1], xmmRow);

Part of the challenge of this code is that each pixel depends on the value of the previously generated pixel, so transitioning from one set of 8 to the next required that I transfer the last finished pixel into the newly read (unfiltered) pixels for the next time through the loop.

Each new generation of Intel CPU adds new instructions but so far they have only added new SIMD instructions, not scalar. Instructions like absolute value are only available in SIMD.

Conclusion

The reason I decided to write a blog post and not just do a pull request for this code has to do with breaking assumptions. A classic example is the 4-minute mile. For many years people believed that no one could run a mile in under 4 minutes and so the assumption stayed true. If C compilers can get a speed advantage by using SIMD instructions for "single lane" code, why can't programmers do the same with hand written intrinsics? By bringing awareness to this topic, hopefully I'm freeing some minds to try this idea and even break some other long-held assumptions.

Search This Blog

Follow me down the optimization rabbit hole