### Optimizing Embedded C - The Median Filter

Optimizing code on desktop and mobile CPUs shares many of the same rules as embedded CPUs, but there are some important differences. In this post, I'm going to walk through optimizing an image filter on an ARM Cortex-M4/M7 processor. The specific code in this case comes from my work with the OpenMV project and optimizing their median filter. For those unfamiliar with their work, OpenMV has created a series of ever more powerful embedded CPU boards with integrated cameras for 'smart camera' IoT applications. The value they bring to the market is not only mating powerful embedded CPUs and advanced camera modules, but a rich software ecosystem. It has a large collection of useful native code primitives written in C, accessed through a Python front end. All of it is controlled from a slick IDE that makes experimentation and deployment painless. I started working with OpenMV a few months ago to optimize their time-critical functions. This week I've been working on their image filters.

The median filter works by analyzing the neighboring pixels and choosing the statistical median value. It scans the original image from top to bottom, left to right and creates a new image made up of these median values. The filter 'kernel' can be an odd sized rectangle of pixels (e.g. 3x3, 5x5, 7x7, etc). Some useful visual properties emerge when the filter size is 7x7.

The figure above is an example of a 3x3 median filter. The default implementation is to gather the neighboring pixels around the target pixel, sort them in ascending order, then choose the middle value from the sorted list. In this case, 9 pixels must be gathered and sorted for each pixel in the original image. As the filter kernel size grows, the number of pixels to gather and sort will grow exponentially. With a 7x7 kernel, 49 pixels must be processed for each source pixel. Below is the original median filter code from OpenMV for grayscale images (some code has been changed to simplify the view):

int *data = malloc(n*sizeof(int));
int n = ((ksize*2)+1)*((ksize*2)+1);

for (int y = 0; y < img->h; y++) {
uint8_t *buf_row_ptr = DEST_ROW_PTR(y);

for (int x = 0; x < img->w; x++) {
int *data_ptr = data;
for (int j = -ksize; j <= ksize; j++) {

int yy = MIN(MAX(y + j, 0), (img->h - 1)));
uint8_t *k_row_ptr =  SRC_ROW_PTR(yy);

for (int k = -ksize; k <= ksize; k++) {
int xx = MIN(MAX(x + k, 0), (img->w - 1)));
*data_ptr++ = PIXEL(k_row_ptr, xx);
}
}
fsort(data, n); // sort the pixels
int pixel = data[n/2]; // take the middle value
IMAGE_PUT_PIXEL(buf_row_ptr, x, pixel);
} // for x
} // for y
free(data);

I tested this filter (7x7) in the OpenMV IDE on a QVGA grayscale image and observed a frame rate of 1.8 frames per second. At a lower resolution (e.g. QQVGA), the frame rate improves, but the usefulness decreases at such a low resolution. This is the reason they asked me to take a look at this code because better performance would allow this filter to be used for more real time applications. One of the big differences between optimization on embedded CPUs versus desktop and mobile is the speed of memory. Most MCUs with integrated FLASH and static RAM have matched the memory speed to the processor speed so that memory can be read/written in a single clock cycle. This is a big difference from desktop and mobile CPUs where DRAM can be hundreds of times slower than the CPU clock and various cache memory mechanisms are used to mitigate that delay. Since memory isn't the bottleneck on our target CPU, the only goal is to reduce the total instruction count.

My thoughts when looking at the code above:
• Try to reuse pixels in the filter kernel from the previous x value
• Look for alternatives to sorting the pixels
• Remove the boundary checks from the inner loop
• Check for special/SIMD instructions to accelerate the work
After doing some searching, I discovered an alternative to finding the median through sorting. A histogram can accomplish the same thing. The median can be found by creating a histogram of the frequency of the pixels in the kernel and the median will be the halfway point as you sum the frequencies. This allows us to more easily re-use pixels from previous kernels as we sweep across the image. As we go from left to right across the source image, old pixels on the left edge of the previous kernel can be removed from the histogram and new pixels on the right edge of the kernel can be added. This allows us to reduce the number of operations on each source pixel from ksize*ksize to ksize*2.
Another name for this type of optimization is the "sliding window histogram". One potential problem with this idea is that 8-bit grayscale data will produce a histogram with 256 "bins". Summing that many elements per pixel will undo the gains we made by removing the sort. There are two remedies to this problem that I thought of - reduce the sample resolution from 8 to 6-bits and make use of a SIMD instruction to sum 4 elements at once.

Here's my histogram search function. The length in our example will be 64 and the cutoff value will be 1/2 of the number of elements in the kernel (49/2 for a 7x7 kernel). The __USADA8 instruction sums the absolute difference between two sets of 4 uint8_t elements and accumulates the result in another register. We don't need the absolute difference between values, but we can take advantage of the horizontal sum of 4 uint8_t values by differencing them against a value of 0. This allows our loop to produce the median with a maximum of 16 iterations per source pixel (64 / 4).

static uint8_t hist_median(uint8_t *data, int len, const int cutoff)
{
int i;
uint32_t oldsum, sum = 0;

for (i=0; i<len; i+=4) { // work 4 at time with SIMD
sum = __USADA8(*(uint32_t *)&data[i], 0, sum);
if (sum >= cutoff) { // it's within this group of 4
while (oldsum < cutoff && i < len)
oldsum += data[i++];
break;
} // if we're at the last 4 values
oldsum = sum; // keep the old value before we pass it
} // for each group of 4 elements
return i-1;
} /* hist_median() */

The optimized inner loop now becomes:

for (int j=-ksize; j<= ksize; j++) {
uint8_t *k_row_ptr = SRC_ROW_PTR(y+j);
pixel = PIXEL(k_row_ptr, x-ksize-1);
data[pixel >> 2]--; // remove old pixels (left edge)
pixel = PIXEL(k_row_ptr, x+ksize);
data[pixel >> 2]++; // add new pixels (right edge)
} // for j
pixel = hist_median(data, 64, median_cutoff); // find the median
pixel <<= 2; // scale it back up to 8-bit range

The part not visible above is the boundary checking. One way to manage it is to break the image into 5 zones (top, left, right, bottom, middle) and treat each differently depending on if it needs pixel boundaries tested or not. One of the constraints of working on embedded platforms is code size. In this case, I chose to make the code smaller by testing if each block needs any boundary testing or not and split the code into 2 sections to avoid having to test each kernel x/y.

The results

The new algorithm with the same parameters as my original test now runs at 16.7 FPS. This is a 927% speedup. This code is actually just a small portion of the imaging pipeline in the IDE, so the real speedup of the filter was much greater. One negative aspect of my changes is that the grayscale sample resolution has been reduced from 8 to 6 bits. This should have very little impact on the final use of the output since the median filter is usually one step in multi-step image process such as edge detection.

1. 