With software, there's always a better way to do it

Warning

I will be naming and shaming a specific tech company, but most companies are guilty of the same thing.

Scenario

You define a standard used in the software industry and provide the reference code to make use of it. The standard is complex enough that developers don't want to re-invent the wheel by writing their own implementations. The reference code is incorporated into commercial products and used by millions (billions?) of people around the world. There is a slight problem - the reference code isn't properly optimized, so all of your users are wasting additional time+energy working with the data you've standardized.

The Specifics

The subject of this blog post is the OpenEXR image file standard created by Industrial Light & Magic. I'm not directly involved in working with these images, but my client is and I saw an opportunity to improve their productivity by optimizing access to them. There's nothing particularly wrong with the reference implementation, but when used to manage thousands of frames of high resolution animation, it becomes a burden on the artists. The standard isn't unreasonably complex, so I set out to write my own version to see how it compared to the reference version (my working assumption is that I can write faster code than open source libraries). In the case of the tool I created, it ran much faster than the previous solution in part due to my optimized OpenEXR code. I'm sure there are commercial versions used by private companies which perform better, but it's really up to the company which publishes the reference code to do a good job.

What I did

There's nothing particularly ground breaking about my OpenEXR image decoder. I've written similar code before for my PNG decoder. The performance advantage came from making use of SIMD and good management of the memory. Adding SIMD code to open source libraries complicates their development in multiple ways. One of the main issues is compatibility with the target machine. Intel's SIMD instructions have been slowly evolving over the last 20 years and every couple of generations, new ones are added. In this case, SSE 4.1 was the right choice to provide the necessary instructions to create an efficient implementation of OpenEXR. SSE 4.1 was first available to the public in the Spring of 2007, so it seems a safe bet to add it to code which will run on professionals' machines. Even an older SIMD instruction set would provide a similar benefit to solving this particular problem. These instructions can be optionally enabled in the code so that it compiles on any machine without modification. I write my optimized versions such that the original C code is left in place and optimized code is #ifdef'd in when appropriate. I can't release my version of this particular code because it was written as 'work for hire' and I don't own it. I imagine that legal obstacles like this prevent many open source projects from seeing improvement. That's why it's so important for the reference implementation to be optimal.

Why Mention it?

This week I'm working on additional tools that need to decode these files and noticed that my implementation is about 3x faster than the one which ships with the latest version of MacOS. Previewing OpenEXR images requires doing a full decode (there is no embedded thumbnail). On my 2016 MacBook Pro 15, it takes so long that you see the spinning busy animation while it loads. I'm not sure if Apple bothers to write custom code for their image handling, but it sure looks like they just took the reference code as-is. In this video you can see the difference between the Apple implementation and my code (run multiple times, so disk caching is not a factor in the speed). The specific image being previewed is 5760x2880 pixels, 3 channels and uses half-float pixels compressed with the "ZIP-16" method.


Stating the Obvious

If you use open source libraries in your products or daily work it's fair to assume that they're not optimized. There are a few outliers which have gotten more attention and are optimized. If optimizing them meant a 10-20% speed increase, it wouldn't be a pressing issue, but 300% matters.The bigger picture - the software that runs the world falls in the same bucket.

Comments

Popular posts from this blog

My adventures in writing an OTA bootloader for the ATmega128RFA1

Controlling lots of OLED displays with a few GPIO pins

Fast SSD1306 OLED drawing with I2C bit banging