Code Optimization Lesson - Simplification

I've been wanting to create a series of lessons on code optimization and something I encountered yesterday seemed like a good starting point. There are many ways to get the same results in software; not all are efficient. Optimizing code is one of those skills that any programmer can do, it just takes a little extra experience and patience. In this lesson I'll walk you through the thought process of how I analyze code followed by the steps to improve it. I will attempt to show you how code looks "through my eyes". We're going to look at part of the font rendering code within LVGL. By optimizing this code, I'm not intending this lesson to be a criticism of that project, it's just a good example of code I analyze. Here's the code in question: (LVGL Github link)

The purpose of this type of optimization is to improve the C code for all target systems, not to slice 1 clock cycle off of register reads.

A 'normal' C programmer will look at this code and see/assume the following:

  • A loop to expand 4-bit anti-aliased font data (individual characters) to 8-bit grayscale
  • Uses a 16-entry lookup table to convert each 4-bit value to 8-bits (e.g. 0x05 -> 0x55)
  • X/Y loops are necessary because the output stride is different from the character width (box_w)
  • Inner loop conditional test of even/odd is needed since character widths can be odd

Here's what I see when I look at that same code:


A) Using a member variable through a pointer as the loop count can cause the compiler to re-read the value each time through the loop since most situations treat indirect pointers to member variables as volatile.
B) An extra variable in the loop (i) to test even/odd pixels may not be needed
C) A separate AND operation to maintain the value of i as 0/1 is not needed since the low bit of the counter can be tested in the conditional statement. Also the extra compare (i ==1) is unnecessary since it can only have the value 0 or 1.
D) Reading the same source byte twice (even / odd passes) can be more expensive compared to reading it once and keeping the value in a local variable.
E) Are the fonts really byte aligned per row or is this an unused option?
F) Is the output stride different from character width?

What do all of these thoughts have in common?

  • Test every assumption
  • Minimize variable count in the inner loop
  • Minimize dereferencing of variables accessed through a pointer
  • Make best use of the conditions you create (e.g. testing the value of i)
  • Minimize branching within the inner loop

How to proceed?

I tested the stride and aligned situations and found that they are not used/needed for the fonts provided with LVGL. This freed me to think about a better lookup table. My first attempt at an optimized loop worked correctly:

How did I get here?

  • Created a larger lookup table (512 bytes vs 16) which takes a byte as input (2x 4-bit pixels) and converts it to a 16-bit value (2x 8-bit pixels). For example, the value 0x13 becomes 0x3311 (the most significant nibble represents the left pixel of each pair).
  • Simplified the loop and removed the 2 conditional statements in the inner X loop. Since the stride variable is always equal to the character width, I was able to collapse 2 loops into 1.
  • Moved the count variable into a local (register) variable instead of dereferencing it through structure member variables.

What are the risks with the optimized code?

  • The output pointer (bitmap_out_tmp) may not be 2-byte aligned, so on some systems (e.g. Cortex-M0) this may cause an exception. Unlikely, but possible.
  • The output may overwrite the buffer by one byte if the product of X * Y is odd. This is easily remedied by making the output buffer 1 byte larger than needed.
  • Some user-provided fonts may break the "byte_aligned" rule and this would cause odd-width characters to be converted incorrectly.

This coding example touches on a small set of ideas that I use for optimizing C/C++ code. The main idea I hope you take away from this article is to test every assumption. In order to write software, one needs to balance a ton of small bits of information in your head. Much of this information is assumptions about how the compiler treats the code and how computers work internally. Even with many years of programming experience, you may have incorrect assumptions about these details. If you're analyzing someone else's code, you may assume that choices made were the correct/best ones. It's hard to break this habit, but it's necessary in order 'see beyond' what you see.

I created an example Arduino sketch to measure the speed difference. The link to the Github gist is here.

The results when running 10,000 iterations of a 16x16 character on an ESP32-S3:

Testing original code
Original code: 10000 iterations, total time = 230 milliseconds
Testing new code
New code: 10000 iterations, total time = 71 milliseconds


Comments

Popular posts from this blog

Surprise! ESP32-S3 has (a few) SIMD instructions

How much current do OLED displays use?

Fast SSD1306 OLED drawing with I2C bit banging