Software Optimization Specialist

Recently I've been talking with potential clients and have needed to explain in detail what it is that I do. My title and resume don't do a great job of conveying that information, so I thought it would be useful to collect my ideas here. Lately, the majority of my time is spent making other people's code run faster, but I also work with bits, bytes, pixels and embedded devices. The following are anecdotes from working with various clients. For security reasons, specific project details may be omitted.

Back End / Server

A while ago, I cold-called a company with a proposition. They run a "software as a service" that processes tons of images for clients using their own server machines. I asked if they were using open source software to run their business and if they were satisfied with the performance. It turns out that I caught them at a good time because they were having performance problems and were about to purchase more servers to handle their growing list of clients. I asked them about the tools they use and not surprisingly, they were depending on ImageMagick and some other open source tools to do all of their image processing. I offered them a risk-free way to see if I could help them and they agreed to try. I ended up writing new Linux command line tools for them to replace most of the ones they were using. For their main use case, my code runs 10x faster than their original and it allowed them to provide better response times and avoid buying more machines. That code has been running their business since then.

Mobile + Desktop

A friend connected me to the Astropad team and suggested I might be able to help them with their next software release. They were looking to make some big changes, including improving the performance. Their product mirrors the Mac's display on an iPad Pro so that the Apple Pencil can be used as a drawing device for a MacOS paint program (among other uses). Their program essentially copies/compresses the display memory on the Mac and then decompresses/displays it on the iPad while capturing the input on the iPad and simulating it on the Mac. I profiled their code and rewrote the time critical sections using SIMD instructions (x86 + ARM64). We then brainstormed some ideas and I helped redesign their data compression scheme. The newest release of their product includes all of these changes and is dramatically faster.

Embedded Project 1

A client created a small, battery operated device which used a DSP (Movidius Myriad2) to process live video. They were having performance problems (overall speed and battery usage). I analyzed their design and suggested they change the compression method used for their imaging pipeline. I provided the custom codec code on both the DSP and the mobile device (ARM). I also rewrote sections of their image processing pipeline. The end result was that the device processed the images 23 times faster and the output was higher quality than the original design. Part of the work involved writing Myriad2 assembly language because the C++ compiler doesn't generate efficient code. This was true for both vectorized code and generic C code. Of course it takes extra effort and is more difficult to write and maintain DSP assembly language, but in this case, speed and battery life were critical to the success of the product.

Embedded Project 2

A client created a small "black box" device for his customers based on an Arduino 101 (Intel Curie) along with a rechargeable battery and charging circuit. The design worked, but the cost per unit was high and it was missing some features which would have been beneficial. I redesigned the project to use less expensive parts and include an LCD display. My design could run for months from a single AA battery and the total BOM cost was reduced to under $8. I designed the circuit and wrote all of the software (LCD support and sensor input) to run on the AVR microcontroller (ATmega328p).

Scanned Imaging Project

A client provides IT services for a large church organization. For years, the church used a document management solution to scan and save electronic copies of their files. After many years, support has ended for their software and they are left with tons of documents stored in a proprietary file format. They cannot find any software to read the files, so this client was given the task of converting them into a supported image format. The client requested that I help perform the conversion task (he found me on StackOverflow). I helped reverse engineer the files and created a command line tool (Linux+MacOS) to extract and validate the images (weed out corrupt data) and rewrite them into a standard file format (TIFF). The tool I wrote does not contain any 3rd party code, but instead uses code from my own imaging library. My optimized CCITT G4 codec allows the files to be processed in a few milliseconds and facilitates handling the hundreds of thousands of images in a reasonable amount of time.

Desktop Artist Tools

A client has a team of artists creating animated stories. Part of their work involves converting the images, audio and 3d models into a final story file for playback. On this particular story, the artists were waiting 20 minutes for their stories to 'compile' each time they made a change. They were all using high-end Mac workstations and the wait time was hurting productivity. I profiled their tools and found that they were using a mixture of open source and proprietary tools to accomplish the task. One by one, I began rewriting and optimizing each tool using SIMD instructions where possible. Some examples of what I improved:

GPU texture compression and decompression
3D vertex compression and decompression
Image file conversion
Binary file reading and writing (was using a Stream class and reading/writing in small blocks)

When I had finished, the task that formerly took 20 minutes to complete was now down to 4 minutes. This allowed them to finish their story before the deadline.

Mobile Story Player

A client publishes a mobile app that runs on iOS and Android. The app needs to display animated features consisting of both flipbook type and GPU rendered animations. In order to run smoothly and not use too much battery power, the player code must be as efficient as possible. I've spent much time optimizing various parts of the video and audio engine, but one particular part stands out. A previous developer had already optimized some code using ARM NEON intrinsics. My assumption is: "there's always a way to make it faster". I analyzed the code and saw that it wasn't making the best use of memory accesses, so I rewrote it. My change doubled the speed of that function. A good reminder that just writing something using SIMD instructions doesn't mean that it's fully optimized.

Camera de-bayer optimization

A client who sells 3D capture hardware needed a faster way to convert color images from the camera (bayer pattern) into RGB and YUV format. The chosen algorithm employed multiple decision paths for determining each pixel color to create the highest quality output. The compiler couldn't turn this algorithm into efficient code due to it's multiple branches and byte-by-byte processing of the pixels. I streamlined the C code and added SIMD instructions which were able to process multiple pixels in parallel. The final result executes 6x faster than the original.

Comments

UnknownAugust 30, 2018 at 2:00 PM
I recall some of the H.264 software codec optimization Larry did for Lync at MS. Always interesting and fun talking with Larry about how best to make vcodecs run faster. Seems like he always got to a great result, often in ways no one had thought about.

Search This Blog

Follow me down the optimization rabbit hole