The case of the missing SIMD code
Intro
Much of the open source software that we all use daily is not optimized, or at least not well optimized. I've talked about this before because it's an issue I feel very strongly about. Code that is run billions of times a day shouldn't be inefficient. In the last 10 years (especially in the last year), more of our lives are spent in front of a computer than ever. Computers have a non-trivial environmental impact. Recently Bitcoin's environmental impact has popped up in the news, but the overall energy use of everyday software in both server farms and homes and offices is significant and steadily rising. There are plenty of people working diligently to make sure the code which runs the world is efficient, but what if some wrong assumptions were holding them back? SIMD instructions are the key to unlocking efficient software on both PCs and mobile devices, but the problem is that it's not used everywhere that it needs to be. You can find SIMD code in many popular code libraries; it's not always optimal and sometimes it's missing entirely. Often times someone has made some effort to put optimized SIMD code into a library and has left it unfinished or not optimal. This bothers me specifically because there is an implied level of trust in SIMD code authors. They need additional knowledge to make use of the vector capabilities of the CPU and this imbues an inherent trust that they know how to write efficient code. When I find inefficient code in widely used libraries, I feel the need to help. Another reason this bothers me is because SIMD code in open source libraries tends to never change. Once the SIMD 'achievement' has been reached, it is virtually guaranteed that it will never be looked at again. The point of this blog post is to not only bring attention to important libraries which need additional optimization effort, but also to a programming topic that doesn't seem to be mentioned anywhere - namely that SIMD code can sometimes speed up "single lane" algorithms too. I've observed that single-lane SIMD is sometimes generated by C compilers, so why not write SIMD intrinsics manually when the situation warrants it?
libPNG
For this article, I'm going to focus on the missing SIMD code in libPNG. This is one of those critically important libraries that is used daily by billions of people and yet it has languished with sub-optimal code for several years. For most images, the majority of time in PNG encoding and decoding is spent in the flate or deflate code, but the image filters can take a significant amount of time too. That is the part that can be optimized with SIMD. PNG filters improve the compressibility of the data by exploiting repeating patterns to reduce the total number of unique symbols to compress. In simpler terms, a PNG filter ideally will make most of the image data into 0's by replacing the pixel value with the difference between it and its neighbors. If a majority of the data is various groupings of 0's, it can be more easily compressed compared to the unfiltered data. In 2016 there was an effort to optimize libpng with SIMD. The author(s), however stopped short of writing SIMD versions of all of the filter functions that needed it because of a wrong assumption. I had a strong hunch that SIMD could help speed up the functions deemed 'not worthy' of hand-optimized code.
The Work
I chose a very large (10000 x 10000 pixel) grayscale PNG file to test decoding speed. The larger the better, so that I could easily see where the time is being spent in the profiler. For my testing, I wrote a simple test program which decodes the image 20x. I enabled the SSE4.1 vector extensions in XCode and enabled the SIMD optimized filters in the libpng build configuration. Now to run my test image in the profiler and see what happens:
As I described earlier, the filter stage can take quite a bit of time. In this case, around 50% of the decode time is spent in inflate and 50% is spent in the de-filtering step. This seemed out of proportion considering that the SIMD code was enabled, so I stepped through the filter setup code and found this:
Oops - No one wrote SIMD code for 1-byte-per-pixel PNG files (grayscale or palette color pixels). There are also some missing (unoptimized) filters for every pixel type. In the code above, 'bpp' refers to the number of bytes per pixel. With the profiler, we can look a little deeper.
This is the original Paeth filter code. In the right column is the x64 assembly language generated by the compiler when set to maximum optimization (-O3). Right away I saw 3 problems that could be improved on x86 by using SIMD code:
- There is no scalar instruction for calculating the absolute value, so it uses a negate and conditional move instead. Starting at line 23, you can see the 3 instruction sequence needed to accomplish the absolute value calculation.
- Scalar instructions don't have a way of managing conditional statements without using a branch. Starting at line 31 you can see there is a comparison followed by a "jl - jump if less". Branching is expensive to do on a pipelined CPU.
- Finally you can see how expensive it is to read/write one byte at a time (lines 37-40). Scalar code can access memory at up to the native integer size (64-bits), but SIMD still has the advantage here with 128-bit access. On the latest generations of Intel CPUs, memory access is much slower than instruction execution even if that memory comes from L0 cache. Memory access isn't always much slower (e.g. Apple M1 memory latency is quite low), but typical mobile and desktop CPUs have this in common.
For those of you already familiar with Intel SIMD (single-instruction/multiple-data), please skip to the next section. SIMD are CPU instructions which work on wide registers (in this case 128-bits) and can treat those bits as groups of different sized integers or floats. For example, with 128-bits we can work on 16 8-bit integers (pixels) at a time or 8 16-bit integers. SIMD instructions in most cases work as a load/store machine. This means that memory needs to be loaded into registers first, then operated on with arithmetic ops and then the results can be stored back to memory. Here is a brief description of the instructions I use in the example of the next section:
Here's a great guide to all of Intel's SIMD instructions:
SIMD
Here's part of the SSE2 code I wrote for the Paeth filter. Newer instructions (e.g. AVX2) would simplify this slightly, but the compiler is actually smart about replacing some of my intrinsics with better replacements. I'm confident that someone else can improve this further, but it's still impressive how much faster it is compared to the scalar code. The calculations require carrying a 16-bit value from the 8-bit pixel differences, so I decided to treat all of the pixel data as 16-bit by expanding it with _mm_unpacklo_epi8. I also preload the next 8 pixels to help hide some of the memory latency.
xmmB = _mm_loadu_si128((__m128i*)prev); // B's & C's (previous line)
xmmRow = xmmRowNext;
xmmRowNext = _mm_loadu_si128((__m128i*)&row[8]); // current pixels
xmmC = _mm_unpacklo_epi8(xmmB, _mm_setzero_si128());
xmmB = _mm_unpacklo_epi8(_mm_srli_si128(xmmB, 1), _mm_setzero_si128());
// a &= 0xff; /* From previous iteration or start */
xmmA = _mm_unpacklo_epi8(xmmRow, _mm_setzero_si128());
xmmRow = _mm_unpacklo_epi8(_mm_srli_si128(xmmRow, 1), _mm_setzero_si128());
xmmP = _mm_sub_epi16(xmmB, xmmC); // p = b - c
xmmRotate = _mm_set_epi8(1, 0, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2); // shuffle mask to rotate right (ROR)
xmmRow = _mm_packus_epi16(xmmA, xmmA);
// Need to insert the next finished pixel into the prefetched next set of 8
xmmRowNext = _mm_insert_epi8(xmmRowNext, _mm_extract_epi8(xmmA, 14), 0);
_mm_storel_epi64((__m128i*)&row[1], xmmRow);
Comments
Post a Comment