### Optimizing JPEG entropy decoding

What
Entropy coding is the term used for lossless data compression . It refers to increasing the entropy of data by taking advantage of the differing probabilities of symbols to create a more compact representation of them. The size of the code used to represent each symbol is inversely proportional to its probability. In other words, the more probable symbols will be replaced with shorter codes, and the less probable with longer codes thereby reducing the total size of the data. A simple example of entropy coding is Morse code. The letters of the alphabet are replaced by a series of dots and dashes where the probability of the letter (in English) determines the length of the code. The way Morse code is defined, there must be a space between each code in order to uniquely recognize it or it could be confused with a different code. For example, the letter E is a single dot. If two E's are sent without a space between them, they would be interpreted as an I. David Huffman invented a different entropy coding system where the codes could be packed together without spaces. This 'minimum redundancy' code is what's used in many computing applications to compress data. This feature of having left-to-right uniqueness also means that Huffman codes will tend to grow in length relatively quickly as the number of symbols grows.  Based on the probability of the symbols being encoded, a Huffman “tree” is created which produces a series of codes made up of a variable numbers of 1s and 0s, each representing a left and right branch on the tree.

Why
Most JPEG image files use Huffman compression (arithmetic coding is a less popular option) as the final step to shrink the size of the symbols used to encode the image. The goal of this article is to document techniques for decoding JPEG Huffman data more efficiently than the "standard" way employed in most software. JPEG images are so ubiquitous across the world that a significant amount of energy and time is spent working with them. Any improvement in decoding efficiency can translate into both improved usability and lower energy usage.

How
Part of the performance problem with decoding baseline JPEG data is that there can be markers inserted into the variable length codes. The markers are byte-aligned 8-bit values of 0xff followed by the 8-bit marker value. The only markers that should be found in the image data are 'stuffed zeros' and restart markers. The stuffed zero (FF 00) is just a way to store a data byte of FF since it would get interpreted as a marker otherwise. Typical JPEG decoders have to read the compressed data one byte at a time, always searching for the value FF and acting accordingly. I've found that it's faster to pre-filter out the markers from the compressed data before decoding. Obviously filtering out the stuffed zeros is not an issue, but what about the restart markers? If the image is not corrupted, then the restart markers occur at known intervals. If it is corrupted, then other action should be taken. So, in the interest of speed, it's ok to strip away the markers to speed up the decoder. It's faster because a dedicated loop removing markers is going to be faster than placing extra tests within the decoder. The dedicated "marker removal filter" can use SIMD to speed up the process of finding FF's within the data stream.

Now let's look at the actual variable length decoding. There are many ways to decode Huffman compressed data.  The least efficient way would be to read a single bit at a time and walk through the binary tree until a terminal node is hit.  The left-to-right unique property of Huffman codes can be used to accelerate this process.  A practical example to use is the A/C coefficient encoding in JPEG image files.  The Huffman codes defining the A/C coefficients vary from 2 to 16-bits in length.  If the maximum length code is read from the compressed data stream, it can be used as an index into a lookup table of all permutations of 16-bits (65536).  The table entries will each contain the symbol’s value along with its true length.  In the example below, the bits that are not part of the current code are marked with an X.
Huffman codes:  00 (length 2),  100 (length 3), 11010 (length 5), … up to 16-bits in length
|----- 16-bits ------|
00xxxxxxxxxxxxxx
100xxxxxxxxxxxxx
11010xxxxxxxxxxx
In each case, the table will contain repeated entries for each possible combination of the unused bits.  In the case of the shortest code (00), there will be 16384 entries in the table which all contain the same uncompressed symbol for 00 along with its true length (2). This method of decoding the compressed symbols is fast (a single read), but the large table size makes this impractical for some applications.  In the case of JPEG A/C coefficients, there can be up to 4 different AC tables per image and each would require 2-bytes per entry for a total of 65536 x 2 x 4 = 512K bytes.  Various techniques exist to reduce the table size, but they usually involve creating a number of smaller tables which require multiple reads in order to arrive at the correct code.

A Better Way
A property of Huffman codes can be exploited to reduce the table size.  In order to maintain the left-to-right uniqueness, long Huffman codes have a number of leading ones (or zeroes depending on how the codes are generated).  Here are some examples of long codes for JPEG AC coefficients:
111111110011
1111111111111110
Each Huffman tree has a practical “split point” where there are a certain number of leading 1s followed by the unique part of the code.  In the case of JPEG AC coefficients, the useful split point is at 5 or 6 leading 1s followed by up to 10 or 11 other bits.  Codes with 5/6 leading 1s will be called “long” codes and codes without 5/6 leading 1s will be called “short” codes.  In this example, the short codes can be up to 10/11 bits long and the long codes can also be up to 10/11 bits long (not counting the 5/6 leading 1s).  The short and long tables are not always the same length, but in this case it works out that way.  In order to decode all short and long codes, 2 tables with 1024 entries in each (10 unique bits) are needed.  Compared to the single table approach, to hold all 4 possible AC tables, 1024 x 2 x 2 x 4 = 16K bytes will be needed.  If the tables are arranged with the long code table following the short code table, decoding can be made even more efficient.  Here is sample C code which demonstrates how to decode JPEG AC coefficients with a single table read.

uint32_t ulCompressedBits, ulCode;
uint8_t iCurrentBitNumber;
uint16_t *usACTable, usValue;
// extract 16-bits (left justified) of compressed data
ulCode = (ulCompressedBits << iCurrentBitNumber) >> 16;
if (ulCode >= 0xfc00) // top 6 bits == 1
à long code
{
// use bit 10 (being 1) to point us to the long table
usValue = usACTable[ulCode & 0x7ff];
}
else
{
// use lower 10 bits
usValue = usACTable[ulCode >> 6];
}
// actual length in upper byte
iCurrentBitNumber += (usValue >> 8);

usValue &= 0xff; // symbol value

Conclusion
This technique is applicable to any Huffman code applications such as digital audio, video, image and file archives (rar/zip files).  The small table sizes and complete decode in a single read lend itself well to hardware implementations.  Mobile devices can benefit from this since they tend to have small data caches and slow memory.  Finding the ideal split point needs to be determined on a case-by-case basis and can be done programmatically or by visual inspection.