Optimizing JPEG decoding

The aim of this blog post is to document my new optimized Arduino/embedded JPEG decoder library. There are already a couple of good Arduino libraries available, so why re-invent the wheel? It's best to take a step back and explain why it's challenging to decode JPEG files on 'small memory' devices, then it will be clearer why I wrote it. Let's say that we want to display a 320x240 image on one of the popular TFT LCD displays. A 320x240x16-bit (RGB565) image requires 153K bytes of storage. For a device like the Arduino Mega2560, it would be impossible to fit the whole image in its 8K of internal RAM. Luckily the way JPEG baseline images are stored, it's not necessary to hold the entire image in memory.  The image is divided into MCUs (minimum coded units) which all derive from a basic 8x8 Discrete Cosine Transform. Each of these blocks can be decoded in sequence and sent to a SPI LCD display individually. SPI LCDs have their own internal frame buffer memory. The MCU RAM will need to hold several dynamic data tables and temporary variables which need to go along with the current MCU. The existing JPEG libraries for Arduino use memory very sparingly and are able to operate within the 8K RAM of the Mega2560. One of the down sides of supporting 8-bit MCUs with so little memory is that the code is 'held back' when running on more capable MCUs. In other words, a 32-bit MCU with more memory can take advantage of additional optimizations to perform much better. This was the motivation for me to modify my old JPEG code to run on more recent embedded microcontrollers. I wrote my own JPEG decoder around 1994 and over the years I've come up with some clever optimizations to improve the performance. For my Arduino library, I decided that the minimum target system is a 32-bit CPU with at least 20K of RAM such as ARM Cortex-M0. With the extra memory and 32-bit wide registers, my code can perform better than the existing libraries. I also added some functionality that wasn't available before such as support for viewing 8-bit grayscale images and Exif thumbnail decoding.

In my previous post about JPEG entropy decoding, I documented a shortcut for fast Huffman decoding. This is just one of many optimizations I use to speed up the decode process. This post aims to explain the main ways I've found to speed up the entire decode process. It would take a much longer article to explain how JPEG images are encoded, so I'll summarize the process first, then explain how I optimize each step.

Here's a very basic summary of the decode process...

  • Read compressed data from a file or memory
  • Decode Huffman symbols into 8x8 DCT coefficients
  • Inverse DCT transform
  • Scale and clip YCbCr output of DCT transform
  • Convert YCbCr pixels into RGB565
  • Draw pixels on the LCD

...and how I optimized each one

File / compressed Data Handling

This doesn't sound like a very complex issue, but it's one of the reasons my JPEG decoder performs well. The various SD card libraries for Arduino/Embedded are all in a similar bind when it comes to using RAM. Fast access to a FAT file system is helped by being able to cache the directory entries. This can't really be done on a small memory device, so the only cache available is usually reserved for the current data sector or sectors. It's best to read data sequentially because seeking will require extra reads of the FAT directory and depending on the card, this can take a long time. It's also a good idea to read the data in large enough chunks so that the call overhead doesn't overshadow the time to execute the code which processes the data. A worst case scenario of this would be to read 1 byte at a time. A second problem exists within the compressed JPEG data - markers. JPEG markers are byte pairs which begin with 0xFF (FFxx) and are allowed to be in the compressed data stream. The only marker that makes sense to be there is the Restart Marker. It was originally added to allow resynchronization after an error in corrupted data. Since modern communication channels generally provide for error-free transmission, restart markers don't offer any useful information to our decoder. They come at regular intervals and if removed, won't cause us any trouble. The presence of restart markers means that any 0xFF byte in the stream must be checked to see if it's a marker. This forces the encoder to add a 0x00 after every 0xFF data byte so as not be confused with a marker. Adding these extra 0x00's is called byte stuffing. So...most JPEG decoders have logic in their "get next byte from stream" functions to check for 0xFF and then recognize a marker or stuffed byte. This checking actually wastes a significant amount of time. In my "get more data" function, I quickly go through the incoming compressed data and remove all markers and stuffed bytes. This allows the inner loop of my decoder to just read incoming data without constantly comparing it for 0xFF bytes. Here's a peek at the inner decode loop 'unshackled' from having to test any conditions other than decoding the code:

if (ulBitOff >(REGISTER_WIDTH - 17)) // need to get more data

{

  pBuf += (ulBitOff >> 3);

  ulBitOff &= 7;

  ulBits = MOTOLONG(pBuf);

}

ulCode = (ulBits >> (REGISTER_WIDTH - 16 - ulBitOff)) & 0xffff; // get as lower 16 bits

Huffman Decode

In addition to faster table lookups for decoding the variable length codes, I also added a few shortcuts to save time when calculating the DCT transform. Here's a peek at the simplified Huffman decode step. A single read from a 10-bit table converts the VLC bit pattern into its symbol and length after testing for 'long' codes:

if (ulCode >= 0xfc00) // first 6 bits = 1, use long table

   ulCode &= 0x7ff; // 10-bit code (long/upper table)

else

   ulCode >>= 6; // (short/lower table)
usHuff = pFast[ulCode]; // get the symbol and length

After converting the VLC to the JPEG symbol and length, there are additional steps to fix the value based on its magnitude and sign, then store it in the DCT array. When decoding using the 1/4 or 1/8th size options, the MCU decode logic skips these steps for coefficients which won't be used. For example, in the case of the 1/4 option, only the first 4 coefficients are needed to create the 2x2 pixel output, so the other 60 coefficients are decoded from the bitstream (they must be since they're variable length), but not fully calculated or stored. This saves a bit of time in the decoding loop.

IDCT Transform

The DCT transform is the heart of JPEG imaging. This matrix operation transforms the data between the spacial and frequency domains. The importance of the DCT in JPEG is that in the frequency domain, if you remove the high frequency components, then return to the spacial domain, you've distorted fine lines, but less 'sharp' features are unaffected. This saves space since you don't have to store those coefficients. It's a reasonable tradeoff for photo-quality images, but doesn't do well on documents with sharp boundaries between colors. My library saves some time in the DCT calculations in 3 ways:

  • Avoids calculating empty rows/columns. These are marked during the decode loop so that calculations on 0's can be skipped.

  • Calculates a limited DCT for 1/4th size (2x2) and skips the DCT for 1/8th size options

  • If there are no non-zero A/C coefficients (aka solid color block), the DCT code is entirely avoided and a shortcut is used to calculate the 8x8 transformed values since they still need to be de-quantized

Adjust and clip YCbCr output

I tried to balance speed with memory usage and one of the compromises was to place a few lookup tables in the code (flash) to help with this. The pixel values generated from the IDCT calculations are in the YCrCb colorspace and can have values beyond the range -128 to +127. These then need to be adjusted and clipped. I created a lookup table to reduce the number of operations. The original operation was:

output = dct_val + 128;
if (output < 0) output = 0;
else if (output > 255) output = 255;

The lookup table version is:

output = ucRangeTable[output & 0x3ff];

The value is masked with 0x3ff to turn negative values into the range 0x200-0x3ff. On a PC or mobile CPU, the difference in speed might be negligible because memory is much slower than instructions, but on embedded CPUs, memory usually isn't the bottleneck.

LCD Output

I've already written a blog post about getting the most speed out of SPI LCD displays. It comes down to writing as many pixels as possible in each SPI transaction. For the JPEG library, the same rules apply. The logical "unit" or group of pixels is the MCU (minimum coded unit). The MCU can be 8x8, 8x16, 16x8 or 16x16 pixels depending on the subsampling option. If we use this as a convenient unit to write pixels to the display, we'll be slowing down the output compared to using larger blocks. To overcome this issue, I reserved a 4K byte (2K pixel) buffer in the JPEGIMAGE structure. I then collect as many MCUs as will fit in that buffer before calling the JPEGDRAW callback to send them to the display. If instead, you would like to control how many MCUs are passed per call, use the setMaxOutputSize(int iMCUs) method.

Final Thoughts

Each of the optimizations above shave a few cycles off the total execution time and allow the library to perform well as a whole. The attention to minute details is always needed in my line of work. My 'big memory' JPEG library also includes SIMD optimizations for x86 and ARM. I didn't add any to this code since that's not the target audience. There is limited SIMD support available on some embedded platforms and I'll re-evaluate whether or not to add them in future versions of the library.

If you've made it this far, hopefully I was able to add some useful optimization ideas to your toolbox for future projects. Feel free to ask any questions by using the comments system of this blog so that others can gain from the conversation.


Comments

Popular posts from this blog

Surprise! ESP32-S3 has (a few) SIMD instructions

How much current do OLED displays use?

Fast SSD1306 OLED drawing with I2C bit banging