Fast SSD1306 OLED drawing with I2C bit banging

What

The SSD1306 OLED displays are very popular with hobbyists due to their low cost and easy interfacing. The majority of the ones sold expose a two wire interface (TWI) aka I2C. The default speed for I2C is 100Khz and the "fast" mode is 400Khz. These are the 2 standard speeds supported by most AVR Arduinos. An I2C clock rate of around 800Khz is also possible on AVR MCUs, but not supported directly by the Wire library. The I2C standard recently added some higher speeds (1Mhz and 3.4Mhz). The 3.4Mhz version uses a slightly different protocol. At 400Khz, using the I2C hardware and the Wire library, I was able to refresh the display around 23.5 frames per second (FPS) with my code.

Why
I have already written a SSD1306 library for both Linux and Arduino, but I wanted to drive the display from an ATtiny85 and learn about the I2C protocol in the process. The ATtiny85 doesn't have I2C hardware built in, so it needs to be emulated in software using GPIO pins. There are several public domain libraries available (e.g. TinyWireM), but I wanted to see how much code is necessary to talk to a write-only I2C device and how well I could optimize it. The SSD1306 OLED controller also supports a 10Mhz SPI interface, so I assumed that the I2C interface on these displays could probably be driven faster than "spec" and not have any major issues. The code I created is not necessarily practical, nor the 'right' way to do it, but I wanted to see how fast I could get it in C/C++ without having to write it in AVR assembly language.

Caution: For this experiment, I'm running the AVR at 16Mhz with a Vcc of 4.5V. I connected the GPIO lines and Vcc directly to the SSD1306. I've seen info indicating that they're meant to run at only 3.3v, and other info showing that they're safe from 3.3 to 5V. Proceed at your own risk. If you run this code on an AVR pre-configured for 8Mhz and 3.3v, you'll see performance of half the values I measured.

How
I grabbed a copy of the I2C protocol specification (Rev 6, April 4, 2014) which is apparently owned by NXP Semiconductors. The condensed version is that there is typically one master and one or more slave devices on the bus (data + clock lines, aka SDA + SCL). The signal lines are normally pulled up to VCC and in tri-state (high impedance). When the master wants to begin a transaction, it sets the lines as output signals and follows the protocol. There is an acknowledge bit that gets sent back from the slave to the master after each byte is sent to signal that it was received successfully. I was curious if this could be ignored and for the SSD1306, it doesn't seem to care. This meant that I could leave the SDA and SCL lines as outputs the whole time I was writing data. Before anyone starts to complain that I'm not following the spec, for this project I'm not interested in creating a 100% compliant I2C protocol emulator, I just want to see how fast I can push the SSD1306 by bit-banging the data into the I2C pins.

First Try
For my first pass, I followed the I2C spec precisely and used the pinMode() and digitalWrite() functions for a functional baseline. As a coder, you don't want too many unknowns to have to debug, so I usually start with the simplest code to get it working. Surprisingly, the code worked the first time and resulted in a display refresh speed of 5.5 FPS. The clock frequency I'm generating varies from byte to byte and bit to bit, but I2C is very forgiving as long as the data is stable during the clock transitions. The speed is not impressive, but that's not a deterrent because I know that those access GPIO methods are slow. A little background - the AVR MCUs come in a variety of configurations and the GPIO ports are mapped to the pins differently depending on the chip. The pinMode and digitalWrite / digitalRead functions hide those differences by referencing everything as a physical pin number. This makes it easier to port your software from an ATMega328 to an ATtiny85. The downside to using those functions is that they do a bit more than just translate the pin numbers and this causes poor performance. The alternative way to access GPIO on AVRs is to reference the PORT (digital output) and DDR (data direction) registers directly. This makes the code less readable to people unfamiliar with ARV MCUs, but necessary in order to gain the speed.

Second Try
Since my I2C protocol code was working with the slower access method, the next step was to convert it to talk directly to the GPIO ports of the AVR. The AVR MCU has unique instructions to speed up access to I/O ports, so setting or clearing a bit (setting a pin to a high or low level) of a GPIO port can be done with a single instruction and execute in a single clock cycle. After replacing all of the I/O methods, the code was now able to refresh the display at 86.5 FPS. This is an impressive speed, but not unique. I've seen existing code on Github which looks similar and probably performs about the same.

Final Steps
This is the part where I get creative and go beyond the "usual" ideas to get the maximum speed. My first instinct is to check that the compiler is doing a good job with my code. The default compiler flags for the Arduino IDE include "-Os". This is the equivalent of -O2 (set optimization level 2) and favor smaller code. To see what the compiler is generating, I found it easiest to use the avr-objdump tool. This is one of several AVR binary tools. Instructions for installing them can be found here. One of the things I noticed was the code being generated for an if/then/else statement wasn't as efficient as just setting the else condition by default, then checking for the "if". Here's the before and after:

This code is in the inner loop shifting the bits out to the SDA line. The before version:


Here's the one which generates faster code:

More advanced compilers would generate the same output for either set of statements, but on the AVR, we have to manually nudge the C compiler to generate the fastest output.

I also noticed that my "inline" modifiers on static functions were being ignored. This may be due to the -Os option trying to make the code small. I eventually brute forced my inline code, but let's explore some other avenues first. The I2C standard says that after transmitting each byte, the SDA line is to go into tri-state on the master and an additional clock cycle is sent to receive an acknowledge (ACK) bit from the slave device. This ACK bit (if zero) indicates that the byte was received successfully and another can be sent. The SSD1306 is designed such that you can keep sending it data bytes forever because the address wraps around when it reaches the end of the internal buffer. This should mean that we can skip checking the ACK bit. I tested this theory by setting SDA low, but not changing it to tri-state. This worked reliably, but is another area where I'm 'breaking' the spec. I also tried leaving the value of SDA at whatever it was at the end of each byte (equally probable to high or low) and that resulted in occasional failures. By removing the code to change the pin direction (DDR register), it saves a few more cycles per byte. At this point, the refresh rate is about 90 FPS. The code I designed for this project assumes that the SDL and SCL pins are controlled by the same AVR port. This is a reasonable assumption for the ATtiny85 since it only has 1 (PORTB). By making this assumption, I can save a bit more time by combining some of the SDA + SCL operations into a single logical OR/AND and have a byte register variable hold temporary results to avoid repeated operations. For example, I can pre-read the current state of the PORT into a byte variable. This preserves the state of the other pins controlled by it. If I clear the bits I'm using for SDA and SCL, then I no longer have to re-read the PORT every time I want to set or clear a bit (the difference between read-modify-write and modify-write). This difference - writing a value from a MCU program register to a port versus using the read-modify-write instructions (e.g. PORT |= value) have a significant effect on the performance. Here's an example how I use that to my advantage in my byte shifting function:


The bOld variable is helpful to simplify the code which toggles the clock bit without disturbing the other GPIO lines controlled by that port. At this point I've gotten above 100 FPS, but there's still more to do. One of the things I've learned from working with data compression is that the shortest code path should be for the most probable symbol. A byte pattern of 0x00 and 0xFF are very probable for image data and have a unique property - all of the bits are identical. This is a time-saving property in this case because it means that the SDA line doesn't have to change while the SCL line is toggled. Adding the conditional statement does use a couple of extra clock cycles, but it's overshadowed by the savings of transmitting these frequently occurring bytes. Here's the inner loop of that byte transmit function with the new check:


With this final change (and brute force inlining of this code), the screen refresh is > 150 FPS on an ATMega32u4 and > 140 FPS on an ATtiny85. I looked at the final output of the compiler and there is still some room for improvement, but only if I write it in AVR assembly language. I may play with that at a future date, but for now, my work is done 😃. You can get the Arduino project code here

Comments

  1. May you please show an example as video? Maybe some jumping balls etc.

    ReplyDelete
    Replies
    1. I agree that a good graphics demo would help show the benefits of this code. Since the display memory can be updated faster than the controller can display it, I think something like an optimized line drawing function would be a good way to show off the speed. I'll work on it when I have some free time.

      Delete
    2. I added an optimized Bresenham line drawing function and recorded a video of it. Do you think it deserves a new blog entry to describe how I optimized the line drawing? https://www.youtube.com/watch?v=aQxOtyEr6eQ

      Delete
    3. I'd personally love to see it. This was an awesome read, and I learned some stuff about graphics and optimization from your posts~

      Delete
  2. man, awesome work! over 10fps is exciting on these oleds! I tried to get it to work for me (UNO/328P) but with this code the oled (128x64 SSD1306) won't turn on. Other libs (u8x8 and SSD1306Ascii) work fine.
    SCL is connected to A5 and SDA to A4.
    changes I've made to the code:
    oled_addr = 0x7B (as printed on the oled's PCB)
    I2CPORT PORTC
    I2CDDR DDRC
    BB_SDA 4
    BB_SCL 5

    ReplyDelete
    Replies
    1. The OLED address is either 0x3C or 0x3D. What you see as 0x7B is most likely 0x78. I2C addresses are 0-127, with the lowest bit being the R/W bit. The way they're used is by shifting them to the right by 1 bit (aka dividing by 2), so the address you see as 0x78 is really 0x3C.

      Delete
    2. I hit this problem too.. Exactly on board which has markings on two addresses at bottom.. Just does not turn on for unknown reasons to me. Then i happened to have other PCB screen and BAM it booted up screen without problems. On other libs that not working oled works fine though.

      Delete
  3. Great stuff Larry - I will definitely have to give you code a go! :)

    ReplyDelete
  4. I've had a play with your code for a project I am working on, and your code is really great! :) I played around with sbi and cbi for some extra optimisation fun too. I noticed that there is a small typo in i2cend(), a "~" is missing in the statement regarding I2CDDR. (I have other things on the same PORT and noticed some strange behaviour when this code was run, as you can imagine). The "~" fixed it nicely though. I'll let you know when I post my blog article on my project involving your fast SSD1306 code if you are interested :)

    ReplyDelete
    Replies
    1. Glad you found it useful. I'll take a look at the error you spotted and push an update.

      Delete
  5. That the SSD1306 doesn't care about the ACK bit is probably by design. I found a datasheet of a 72x40 LCD module with SSD1306, and if you look at the example I2C code on page 23, you'll see almost an exact (albeit less efficient) copy of your code.

    http://www.icbanq.com/data/ICBShop/board/ZJY001_0.42_16P%20OLED.pdf

    ReplyDelete
  6. Two comments about the interface.
    1 if you set the output pins to low and switch the direction bit you get the correct open collector configuration needed to read slave response, assuming you want to read it.
    2 bus speed is a function of supply voltage and resistive/capacitive loading. Reducing bus voltage to 3.3v should support higher bus rates. Slower CPU clock will reduce speed.
    I wonder if you could use a little hardware and an interrupt to detect presence /lack of the ack?
    It's great to see someone hacking at this level.
    I did a system with sw i2c slave, sw uart, and sw 1wire on an msp430.
    That was interesting.

    Jim

    ReplyDelete
  7. Do you know what the max frame rate is? Using your cool ideas it's possible to bitbang some pixels at hundreds of FPS...but that's ultimately just the vram update rate, not the display's scan-out rate. The specs show a FR pin for tear-free sync, but it's not brought out in a way that I am able to measure with logic analyzer, let along bring to the AVR to use.

    ReplyDelete
    Replies
    1. There's a section in the datasheet which describes how to set the clock divider for determining the frame rate. I think it can go as high as 100FPS. The real problem is synchronizing updates with the redraw - there's no vblank signal to know when the display is restarting from the top.

      Delete
  8. Pero muy buen trabajo!!!
    felicitaciones !
    Anda perfecto en turbo!!!
    gracias

    ReplyDelete
  9. masterpiece!
    I want to port this program to SAMD Arduino. I don't have the technology yet.
    I think have to do is rewrite the port operation...

    ReplyDelete

Post a Comment

Popular posts from this blog

Surprise! ESP32-S3 has (a few) SIMD instructions

How much current do OLED displays use?