Optimizing access to serial (I2C/SPI) displays
Intro
I've been thinking about writing this post for a while. I decided to do it because I've recently been working on my SPI LCD library and have been more engaged with other engineers on working with serial displays. The purpose of this post is to share some thoughts on getting the best performance from serially connected displays. I've randomly reached out to a few people I saw on Twitter having performance issues and thought it would be better to collect my thoughts here to reach a wider audience.The Problem
Hobbyists typically buy a serial (I2C/SPI) display for their project, make use of a third party library to drive it and wonder why the performance of their software is disappointing. I believe this is due to the following misconception:
"Every pixel takes the same amount of time to draw, right? If I write data to my 240x320 LCD at 40Mhz, I should be able to get 30+ FPS of full screen updates."
Explanation
Serially connected displays use the same basic progressive scan technique to refresh the image. The pixels starting from the top left get updated line by line each frame. The display memory is laid out sequentially and follows the scan order (left to right followed by top to bottom). To change what's displayed, the values stored in this memory must be changed.
The serial interface (I2C/SPI) forces the updates to occur one bit at a time. Each pixel (of color LCDs) is 16 bits, so the most pixels that can be changed per second would be CLOCK_RATE / 16. For example, a 240x320 display with a 40Mhz SPI clock could potentially write this many pixels per second:
40,000,000 Hz / 16 bits per pixel = 2,500,000 pixels per second
There are 240x320 = 76800 pixels on the display, so the maximum frame rate could be:
2,500,000 pixels per second / 76800 pixels per frame = 32.55 frames per second
This sounds reasonable. With a 40Mhz SPI clock, we could get a decent frame rate. Now suppose we just want to change a small spot that's 16x16 pixels in the middle of the display. If we were to write all of the pixels to the display each frame, we would need to keep them in memory and modify our copy of that memory in order to display the correct content. This creates a new problem. With 76800 16-bit pixels, we would need a memory buffer of 153600 bytes. For Linux, Mac or Windows systems, this is a trivially small amount, but for microcontrollers, this is much more than most have available (e.g. The popular Arduino UNO only has 2K of RAM). Yet, it is possible for these resource-limited systems to drive these displays. How?
The LCD Command Set
The makers of display controllers were well aware that it would be beneficial to not force the microcontroller to have to hold a video buffer which mimics the memory of the display. The commands and features appear to be based on the MIPI Alliance command set. Each manufacturer has variations, but they all follow the same basic rules. In order to affect a relatively large image buffer through a relatively slow serial interface, the commands allow you to specify a small "window" of memory to work with. For example, to change a 16x16 block in the middle of the display (at 120,160), we would use the following (stylized) commands:
Set Horizontal Position + Width (120, 16)
Set Vertical Position + Height (160, 16)
This would set up a memory window of 16x16 pixels in the middle of the display and set the current memory pointer to the upper left pixel of that block. The next 256 (16 x 16) pixels written to the display will be written into that small window in a left-to-right, top-to-bottom order. If you write past the last pixel, it will wrap around to the first pixel and repeat. So these commands allow you to access the display memory in smaller rectangular windows. The default "window" is the whole display, with the memory pointer starting at 0,0.
Unfortunately the display controller manufacturers didn't have enough foresight to add some simple ways of accelerating this access by adding for example, run-length encoded commands or higher level graphics primitives. There are a few LCD controllers that have more advanced 'acceleration' features, but they are the exception.
The Performance Gotcha
So far, the details I've shared don't sound like performance would be bad when using these displays. There's a little technical detail which gets in the way of making things run smoothly. The display controller needs to know when you're sending it commands versus pixel data. There are two ways this is done:3-Wire SPI
This is a variant of SPI in which an extra bit precedes each byte of data and indicates whether it should be interpreted as a command or as data. This means that transmitting each byte requires 9 bits instead of 8. Normal SPI controllers don't have this option and most solutions to talk to these interfaces use bit banging to handle this special case. The following diagram was taken from the datasheet of one of the common display controllers (ST7789) from Sitronix. The bit marked in yellow is the data/command bit which precedes each byte.
4-Wire SPI + D/C Bit
This is the more common interface. An extra GPIO line is used to tell the display if it should interpret the incoming bytes as commands or data. In the diagram below, the D/CX line tells the controller how to interpret the next byte transmitted.
So, Where's the Performance Problem?
This little detail of needing to know if the data should be treated as a command or pixels is what gums up the works. The reason this hurts performance is twofold. The first is that the GPIO line used to signal D/C must be set and allowed to "settle" before the SPI transaction begins. The second problem is that each SPI transaction takes time to start and stop - there are buffers that usually get copied, and another GPIO line is toggled (chip select) before and after the data is written. What makes this worse still is that most LCD controllers treat the "window" commands as a combination of both commands and data. This means that setting up little windows to modify a few pixels incurs a bunch of extra delays.
Example
On twitter I saw a tweet where someone had encountered a performance problem when trying to display small a bitmap scaled 3x larger on a SPI LCD. I'm not trying to shame this person, but it serves as a good example of how seemingly innocuous choices can have dramatic effects on performance. This pseudo code represents how they had written their original algorithm:
uint16_t usColor, *pSrc = pIcon;
for (y=0; y<icon_height; y++) {
for (x=0; x<icon_width; x++) {
usColor = *pSrc++;
// draw a 3x3 box for each pixel
DrawRectangle(StartX + x * 3, StartY + y * 3, 3, 3, usColor);
}
}
The code looks perfectly logical and works fine. The only problem is, now that we know how the LCD controller works, we can see the problem with switching back and forth between command and data mode for each expanded pixel. Those 3x3 rectangles don't look like such a good idea after all. Every 9 pixels, the following is happening:
The improved version does require a small amount of RAM to hold the expanded pixels, but will run considerably faster:
uint16_t usColor, *d, *pSrc = pIcon;
uint16_t usTemp[icon_width*3];
SetLCDWindow(StartX, StartY, icon_width*3, icon_height*3);
for (y=0; y<icon_height; y++) {
d = usTemp;
for (x=0; x<icon_width; x++) {
usColor = *pSrc++;
// expand the pixels by 3x horizontally
*d++ = usColor; *d++ = usColor; *d++ = usColor;
}
// Write 3 identical lines of our expanded pixels
writePixels(usTemp, icon_width*3);
writePixels(usTemp, icon_width*3);
writePixels(usTemp, icon_width*3);
}
for (x=0; x<icon_width; x++) {
usColor = *pSrc++;
// draw a 3x3 box for each pixel
DrawRectangle(StartX + x * 3, StartY + y * 3, 3, 3, usColor);
}
}
The code looks perfectly logical and works fine. The only problem is, now that we know how the LCD controller works, we can see the problem with switching back and forth between command and data mode for each expanded pixel. Those 3x3 rectangles don't look like such a good idea after all. Every 9 pixels, the following is happening:
- The pixel boundaries of the rectangle are compared to make sure it is on the visible portion of the display
- 3 commands (set x,cx, set y,cy and then a memory continue command)
- 2 sets of data are sent between the commands to specify the coordinates
- 9 pixels are written to the display memory
The improved version does require a small amount of RAM to hold the expanded pixels, but will run considerably faster:
uint16_t usColor, *d, *pSrc = pIcon;
uint16_t usTemp[icon_width*3];
SetLCDWindow(StartX, StartY, icon_width*3, icon_height*3);
for (y=0; y<icon_height; y++) {
d = usTemp;
for (x=0; x<icon_width; x++) {
usColor = *pSrc++;
// expand the pixels by 3x horizontally
*d++ = usColor; *d++ = usColor; *d++ = usColor;
}
// Write 3 identical lines of our expanded pixels
writePixels(usTemp, icon_width*3);
writePixels(usTemp, icon_width*3);
writePixels(usTemp, icon_width*3);
}
Notice the main difference? The improved code only sends 1 set of commands to set the LCD window before the start of the loop. After that, the pixel data is written in fewer SPI transactions. If your system has enough memory, an even faster version would be to prepare all of the pixels in RAM and write them in a single transaction to the display.
If you're working with Arduino boards (original AVR MCUs), the SPI bus can only go at 8 or 4Mhz (half the processor speed). Even in this case, you can still do some impressive things with these displays if you don't need to change every pixel every frame. I reference AVR MCUs because they have a very limited SPI speed for doing video updates. Even if the CPU/SPI could be fast, but the SPI bus of the display only runs slowly, you have a similar situation. In the video below, is a demo of my Pac-Man emulator running on a Raspberry Pi Zero and driving a 320x480 display at 12Mhz (relatively slow). My emulator takes advantage of the fact that PacMan doesn't change much of the display each frame and I optimized my code to write only the changed pixels. This allows 60FPS animation on a relatively slow SPI LCD.
Example 2
This video shows the time needed to draw diagonal lines across the display using direct rendering (data is sent to the display while drawing each line) versus indirect (data is sent to a local buffer, then written to the display in one shot). A diagonal line (slope 1) is a worst case scenario for SPI displays since it can only draw one pixel at a time and then must use SetLCDWindow() again. There's a 4 second pause between each type so that you can see the time spent (1375ms versus 41ms)
Takeaways
This video shows the time needed to draw diagonal lines across the display using direct rendering (data is sent to the display while drawing each line) versus indirect (data is sent to a local buffer, then written to the display in one shot). A diagonal line (slope 1) is a worst case scenario for SPI displays since it can only draw one pixel at a time and then must use SetLCDWindow() again. There's a 4 second pause between each type so that you can see the time spent (1375ms versus 41ms)
Takeaways
- Calculating the potential framerate of serial connected displays depends on more than just the SPI clock rate
- Plan ahead to minimize the number of output window changes
- Write as many pixels as possible in each SPI transaction
- I haven't specifically referenced I2C displays in this post, but the same ideas apply
Hopefully this will help you get the most performance out of your serial displays.
Excellent write-up as usual. I'm curious what the performance for something like this would be for an ARM cortex-m0 device like say, the STM32F0x0 series of micro-controllers from STMicroelectronics, "With 76800 16-bit pixels, we would need a memory buffer of 153600 bytes...If you're working with Arduino boards (original AVR MCUs), the SPI bus can only go at 8 or 4Mhz (half the processor speed)"; the high end model can deliver 48 MHz CPU and 256 K of RAM.
ReplyDeleteYou can do quite a bit with these displays if you're running at 48Mhz and have plenty of RAM. In fact, I'm updating my SPI_LCD Arduino library to support a back buffer (mostly targeting the ESP32) to allow delayed rendering and more complex graphics operations.
Delete