How to speed up your project with DMA

Intro

DMA (direct memory access) is a topic that's similar to pointers in C - it's not easy for everyone to visualize how it works. My goal for this blog post is to explain, in the simplest way possible, how it works and why your project can benefit from proper use of it.

What's it all about?

DMA is a useful feature for a CPU/MCU to have because it means that data can move around without the CPU (your code) having to do the work. In other words, DMA can move a block of data from memory-to-memory, peripheral-to-memory or memory-to-peripheral independently from the CPU. For people used to programming multi-core CPUs with a multi-threaded operating system, that may not sound very special. For those of us familiar with programming slow, low power, single threaded, embedded processors, it can make quite a difference. Here's a practical example - sending data to an SPI device (e.g. a small LCD display):

Without DMA

<prepare data1 - 10ms > <send data1 to SPI - 10ms > <prepare data2 - 10ms > <send data2 to SPI...>

With DMA

<prepare data1 - 10ms > <send data1 to SPI - 0ms > <prepare data2 - 10ms> <send data2 to SPI...>

Notice the difference (see the red and green highlight)? The CPU can start working on the next set of data without having to wait for the SPI transmission to complete. In this fictional scenario, the effective throughput of your device has doubled. Instead of 10ms to prepare plus 10ms to send, your machine can be preparing and sending data at the same time! Obviously there's no magic happening to make something take no time at all, but what actually happens is that once you start the DMA going, your code can immediately get back to work. In the example above, if the data preparation were to take less time than the data transmission, the CPU would have to wait for the previous DMA transmission to complete before it could start the next (sending data2). Many systems have a way to chain DMA transactions together so that when one completes, the next will immediately start.

Where is it used?

Systems with DMA hardware can usually point the source and destination to a lot of different peripherals and memory areas. Here are some other examples of where DMA can be useful:
- Moving data in and out of local (tightly coupled) vs external (DRAM) memory on DSPs
- Reading samples from a sensor (SPI/I2C/ADC) at a fixed rate
- Sending and receiving blocks of data to DAC/I2C/I2S/SPI/UART devices
- Creating a repeating output pattern (e.g. signal generator)

The focus of this article is the ESP32 with a SPI LCD attached, so from here onward, I'll only be discussing that use case.

A Common Pitfall

One of the most common points of failure for people new to DMA is accidental 'memory corruption'. This occurs when data is given to the DMA hardware to transmit and then the user's code returns to preparing new data and then oops! The output somehow gets corrupted. When the user disables DMA, everything works correctly - hmm...The problem in this case is that passing data to the DMA hardware implies that the data will be transmitted in the background. That block of data can't be modified before the DMA transaction completes or you will be changing/corrupting the output before it gets sent. The fix is to manage multiple buffers so that you can keep working while old data is being sent. A common way of handling this is called "ping-pong" or double-buffering. The idea is that you work in one buffer, pass it to the DMA hardware, then switch to working on new data in another buffer. Each time you send the current data, you swap to working in the other buffer. This is usually the most practical way to leverage DMA without using tons of memory to queue future transactions.

Example Project

One of my more popular open source libraries is AnimatedGIF. It allows you to play GIF animations on MCUs with all types of displays. One of its primary design features is that it decodes one line at a time and passes it to the user code in a callback function called GIFDraw(). I designed it this way to allow large images to be decoded by MCUs with tiny internal memories. The MCU only needs to pass these pixels to the display (usually a SPI LCD with its own frame buffer). If transmitted with SPI DMA enabled, the GIF decoder can be decoding the next line while the current line is being sent to the display. Depending on how fast the CPU can decode each line, this can potentially make the SPI transmit time effectively 0. I've written an Arduino sketch to demonstrate how this works:

https://github.com/bitbank2/CYD_Projects/tree/main/gif_example

In the photo above, you can see it running on the "original" Cheap Yellow Display. The 240x240 "Hyperspace" GIF can animate (unthrottled) at 22 frames per second without DMA and 31 FPS with DMA enabled. The speed isn't doubled by enabling DMA, so this indicates that preparing the pixels takes more time than sending them to the LCD. The SPI transmit time has basically been eliminated from the animation loop. This is a ILI9341 240x320 LCD at a clock speed of 40MHz. A device with a slower LCD interface would benefit more from enabling DMA. For this example, double-buffering isn't needed because by the time the pixels are ready to pass to the GIFDraw callback function, the last line has already finished transmitting.

How do I use DMA in Arduino?

This is an excellent question. At the time of Arduino's inception, the MCU they chose (Atmel ATMega328) didn't include DMA hardware. Atmel's (and other vendors') newer MCUs do include DMA, but Arduino hasn't added support for it into their official API. I can't really blame Arduino for this omission; DMA hardware can vary greatly and making a simple API to access it isn't trivial. Adafruit made a separate DMA library for the ATSAM MCUs. For this project, I'm using an ESP32 MCU and I used the ESP-IDF functions to access SPI+DMA instead of the Arduino SPI class. The good news is that the complete ESP-IDF API is also present in the ESP32 Arduino board support, so this code can work inside and outside of the Arduino environment. It's also fortunate that Espressif's SPI (and DMA) API is easy to use and there are plenty of example projects. The ESP32 (depending on the model) can control 2 or more SPI buses on (mostly) any GPIO pins. Here's how to initialize it for our project:

spi_bus_config_t buscfg;
spi_device_interface_config_t devcfg;
esp_err_t ret;

    memset(&buscfg, 0, sizeof(buscfg));
    buscfg.miso_io_num = -1; // for full duplex devices like uSD, we would need a valid GPIO
    buscfg.mosi_io_num = iMOSI;
    buscfg.sclk_io_num = iCLK;
    buscfg.max_transfer_sz= (320*24*2); // enough to send 1/10th of the display at once
    buscfg.quadwp_io_num=-1;
    buscfg.quadhd_io_num=-1;
    // Initialize the SPI bus and let it choose the first available DMA channel
    // The bus numbering varies by chip type, for the original ESP32, VSPI is the one we want 
    ret=spi_bus_initialize(VSPI_HOST, &buscfg, SPI_DMA_CH_AUTO);
    assert(ret == ESP_OK);

The code above tells the system to reserve a SPI hardware unit for the GPIO pins we want to use. Next, we specify the info to talk to our specific device (multiple instances can be created, each with a unique CS (chip select):

         spi_device_handle_t spi; // this is a static global variable to use elsewhere
    memset(&devcfg, 0, sizeof(devcfg));
    devcfg.clock_speed_hz = iFreq;
    devcfg.mode = 0; // SPI mode 0
    devcfg.spics_io_num = iCSPin;          //CS pin
    devcfg.queue_size = 1; // one transaction at a time is enough
    devcfg.post_cb = spi_post_transfer_callback;
    devcfg.flags = SPI_DEVICE_HALFDUPLEX; // output-only device
    // Add this device to the bus and return the access handle in 'spi'
    ret=spi_bus_add_device(VSPI_HOST, &devcfg, &spi);
    assert(ret==ESP_OK);

Now we're ready to start using the SPI interface with DMA. There are two optional callback functions you can use - a before (pre) and after (post). I pass the pointer to a 'post' function above to know when the DMA transfer has completed. I use this info to set a volatile flag to know if I need to wait for a previous transfer to complete before requesting a new one:

static void IRAM_ATTR spi_post_transfer_callback(spi_transaction_t *t)
{
    spi_transfer_is_done = true;
}

The 'spi_transfer_is_done' variable is declared as 'volatile bool' so that it can be used as a simple semaphore. I wrote a wrapper function over the ESP32 API to write data with or without DMA. The reason you might want to write it without DMA is in the case of LCD initialization. A long sequence of commands and data must be sent the LCD and there's no benefit to letting DMA send the data and then waiting for each transaction to complete before being able to send the next commant. Here's my SendData() function:

    void SendData(uint8_t *pData, int iLen, bool bDMA)
    {
        spi_transaction_t t;

        memset(&t, 0, sizeof(t));
        while (!spi_transfer_is_done) { // wait for previous DMA transfer to complete
            vTaskDelay(0);
        }
        t.tx_buffer = pData; // the data we're sending
        t.length = 8*iLen; // the length is specified in bits, not bytes
        if (bDMA) { // for DMA, we use a different function to 'queue' the transaction
             spi_transfer_is_done = false;
             spi_device_queue_trans(spi, &t, portMAX_DELAY);
        } else {
             spi_device_polling_transmit(spi, &t); // for non-DMA, we 'poll' AKA 'wait' for it to complete
        }
    } /* SendData() */

On many embedded systems, there's no API provided to use DMA, you must directly write to the memory-mapped hardware registers. Espressif has done a great job at defining an API which simplifies access to the hardware without limiting the features or performance.

Buffer Management

What isn't shown above is how to manage the ping-pong buffers (if needed). A simple arrangement can be done like this:

static uint8_t buf0[2048], buf1[2048];
static uint8_t *pDMA = buf0;

void MyDrawFunction()
{
   while (1) { // do this forever (why? because this is example code)
       uint8_t *d = pDMA; // point to active DMA buffer
       <do some drawing into the buffer>
      SendData(pDMA, iLen, true); // transmit with DMA
      if (pDMA == buf0) { // swap between the buffers
          pDMA == buf0) {
              pDMA = buf1;
      } else {
              pDMA = buf0;
      }
      } // while (1)
} /* MyDrawFunction() */

As you can see above, the idea is that once you pass the buffer to be sent by DMA, you swap the pointers and do your next task in the other buffer and then swap again when you send the next. You can also queue multiple DMA events to be executed in succession, but I haven't needed to use this feature, at least not for working with SPI LCDs.

What about other target systems?

One of the challenges of using DMA with Arduino is that it's not part of the core API. This means that you have to write different code for each MCU you want to support with your project. This is a smaller problem for most people because most projects target a single type of MCU. For me as a library author, it's a big problem. I have to write unique code for each different MCU. In my bb_spi_lcd library I support multiple MCUs with some creative #ifdefs, but the main 'brand' of MCU supported are the ESP32 from Espressif Systems. DMA can be abstracted with a single API, but it's a bit more exotic for a platform designed for new makers.

Comments

Popular posts from this blog

Surprise! ESP32-S3 has (a few) SIMD instructions

How much current do OLED displays use?

Fast SSD1306 OLED drawing with I2C bit banging