A 'Low Memory' GIF decoder
I just released my AnimatedGIF library for Arduino and it contains a lot of optimizations and workarounds to perform well on microcontrollers. I thought it might be useful to document what makes it different from other GIF decoders out there. First a little background on GIF images:
GIF (Graphics Interchange Format)
Compuserve was an early internet service provider and they created an image format for streaming compressed pictures over a slow and unreliable channel (dial-up internet). Their first format (GIF87) utilized LZW compression for transmitting a single image. LZW compression is a clever way of finding and compressing repeated sequences of symbols and outputting them in a (VLC) variable-length-code stream. The GIF format treats images as a continuous (one dimensional) stream of pixels. By doing it that way, it misses out on taking advantage of horizontal and vertical symmetries. The PNG file format compresses image more effectively not just because Deflate/Inflate compression is slightly better, but also because it uses various filters to take advantage of 2D symmetry. There's nothing stopping someone from using these same filters with LZW compression, but the die has been cast and we're stuck with the standards as they are. The file format evolved to allow animated sequences of images (GIF89). A few years after the GIF file format had gained popularity on the new "World Wide Web", a new problem emerged. The LZW compression used in GIF was protected by a patent owned by Unisys Corporation. This wasn't widely known when GIF was gaining use. The existence of the patent caused a great stir when Unisys announced it would enforce it long after the GIF format had taken hold and had become the de-facto standard for website graphics. Many lawsuits were opened to try to collect royalties on the use of LZW compressed images and on the code which viewed/generated them. Many companies complied and others stopped using them or tried to transition to other image formats. The patent(s) relevant to GIF all expired in 2004.
GIF on Embedded Boards?
The code required to parse the GIF file and decode the LZW stream is not terribly complicated, but it presents some challenges to run it on small memory devices. The LZW algorithm needs to keep a 'dictionary' of symbols and pixels as it decodes the data. For GIF, the maximum code size is 12-bits, so the dictionary needs to hold 4096 entries. In my decoder, I use an array of uint16_t integers for the codes and two arrays of uint8_t for the pixel runs, so this takes a total of 16K of RAM. I also need some buffers for reading data from the GIF file, holding a row of pixels and holding up to 2 color palettes. The total comes to around 22.5K. In order to make it safer/easier to run on embedded systems, I don't use malloc/free, so my GIFIMAGE structure has all of the buffers statically defined. This allows the user to instantiate the class statically or use 'new' if they want to dynamically create a copy of the class. The code runs successfully on a Cortex-M0+ with 32K of RAM, so the stack and other system variables all fit within the 32K. I assumed when writing this code that the display device would be no larger than 320x480 pixels since this will like run on Arduino compatible boards using low cost SPI LCD displays. The other challenge to decoding GIFs on a small memory device is described below.
An animated GIF contains the canvas size (overall image size) and each frame of the animation can be that size or smaller - it only needs to contain the pixels which changed. With transparent pixels and frame size adjustments, the size of an animated GIF can be reduced compared to just compressing each frame independently. This also means that the frame sequence must always be played sequentially; you can't assume that there are "key frames" to allow you to quickly jump ahead or backwards.
One of the parameters for each frame is the "disposal method". This tells you how to "dispose" of the frame after it's displayed, in preparation for the next one. For most GIF files I've encountered, this is defined as "do nothing". It can also be "revert to background color". This is easy enough to do. The difficult option is "revert to previous". This option would require the decoder to keep a copy of the previous frame image (in its entirety) to "undo" the current frame's changes in preparation for the next frame. Here's how it looks on a time line:
Revert to background
<decode + display frame N> <repaint frame N's area as bkgd color> <decode + display frame N+1>
Revert to previous
<decode + display frame N> <repaint frame N's area as frame N-1> <decode + display frame N+1>
Implementing the "revert to previous" option isn't very complicated, but it presents a major hurdle for a low-memory environment. It would require the decoder to have a complete copy of the previous frame. If we're displaying a 320x240 image sequence, that's 76800 bytes. One of my goals for this project is to have it run successfully on low-RAM MCUs like the Cortex-M0+ (32K RAM). It's not possible to support every GIF display option using embedded boards since they typically don't have very much RAM.
Decoding GIF images isn't very compute-heavy, but there are still challenges to making it run quickly on a humble microcontroller. There are 3 main challenges to playing animated sequences quickly:
- Reading the image data (from disk or memory)
- Decoding the LZW codes into pixels
- Displaying the pixels
Let's examine each of them in order.
Reading GIF Data
This shouldn't present much of a challenge, but there are some gotchas here too. My library supports reading GIF image data from any source. This is done by allowing the user to provide callback functions for open/close/read/seek. The callback code to read from memory is built into the library and example callback functions for SD cards are given in the example sketches. Reading from memory (FLASH/RAM/ROM) doesn't present any problems, but reading from SD cards gave me some unpleasant surprises when I was testing the code. The first surprise was on the Teensy 4.1. I found that if I read a file up to and including the last byte, the seek() function no longer worked. This prevented the file from looping without closing and re-opening it. The other unpleasant surprise (true on most embedded SD card library implementations) is that seeking can take a very very very long time (e.g. > 500ms). Part of my strategy for speeding up access to external media is to reduce the total number of read() calls by reading relatively large chunks of data. The problem with this is that my old logic was using seek() to reset the file position if I read past the end of the data than I needed. I rewrote the logic to avoid seeking by using this extra data instead. There is still the need to seek in some cases, but it's very rare.
GIF files' use of the LZW algorithm generates variable length code words from 3 to 12 bits each. Since the codes are not conveniently sized in terms of bytes, it's necessary to shift/mask the incoming data to extract the codes. Another layer of complexity is that the data is not just laid out in an unending stream, but broken up into 'chunks' of 1 to 255 bytes each. This was probably done to recover from errors in transmission in the old days, but it just slows things down today. The chunks are specified as a 1-byte length followed by that many LZW bytes. The chunk data of the frame ends with a chunk length of 0. There are multiple ways to handle the VLC decoding situation, some slow, some fast:
Worst - Read a byte at a time from the input stream to have enough data for the current code; test for reaching the chunk boundary each time
Better - Read 32/64 bits of source data into a variable, extract multiple VLC codes from that and check the chunk boundaries when refreshing the 32/64-bit variable
Best - Read and de-chunk enough source data into a buffer so that you can decode an entire line without worrying about chunk boundaries. Keep the data in a 32/64-bit variable so that you can extract multiple codes without re-reading memory.
As you can probably guess, my implementation uses the Best strategy from above. The reason I put emphasis on this code is because the inner loop (aka code that executes the most frequently) must be as efficient as possible. For GIF decoding, that code is the VLC extract/decode loop. That inner loop must shift/mask the incoming data, but if you can avoid testing the chunk boundaries every time you'll execute a lot quicker. In my code I read up to 6 chunks of 255 bytes into a buffer and test if it needs to de-chunk more data once every 4 lines of output image. On my PC version of this code, I de-chunk the entire frame before entering the decode loop. Since RAM is a precious resource on embedded CPUs, a 1530 byte buffer seemed reasonable.
The pixels encoded in GIF files are 1 to 8 bits each and use a palette of 24-bit RGB values to translate them into the final display color. Every pixel must be translated through the palette to display each frame, so it's important that the code is efficient. Since this library was designed to display images on low cost LCDs which typically use RGB565 pixels, I translate the 24-bit palette entries into RGB565 values at the start of each frame when the palette is defined. That way, each pixel only needs a single table lookup to be translated into the form needed by the LCD controller.
On PCs and some mobile or embedded devices, the display is rendered directly from memory to an LCD or HDMI video output. In these cases, to change what's being displayed, it's simply a matter of writing to the correct spot in memory and the next time the frame is refreshed, the new content will show up. Writing to this memory doesn't normally add any additional delays compared to writing to regular RAM. In the world of Arduino/embedded devices, displaying graphics is usually done in a less convenient way with LCD displays connected through I2C or SPI serial links. I wrote another blog post about optimizing access to these types of displays, but I'll summarize it here. It's time consuming to transmit pixel data over a serial link. The speed of the link determines how fast you can send those pixels and if you don't have a DMA controller, the CPU must wait for the write to complete. I discovered just how slow this can be with the Arduino Nano 33 BLE board. The CPU has enough RAM and is plenty fast, but the SPI bus can only run at a maximum of 8Mhz. That may sound pretty fast, but most of the popular SPI LCD controllers can handle 60-100Mhz. Here's what 8Mhz (no DMA) looks like on the Nano 33:
And here's a slower processor (48Mhz Cortex-M0+) decoding a larger image with faster SPI (24Mhz and DMA):
I hope this article helps clarify what went into making an optimized animated GIF player for Arduino/embedded. Feel free to ask questions in the comments if something is still unclear.
To follow my daily project activities, the best place to look is on my Twitter Feed.