ESP32-P4 SIMD Explained

Introduction

N.B. - This article is for programmers who are familiar with SIMD.
Like the ESP32-S3 before it, the ESP32-P4 includes SIMD instructions - Espressif calls them 'PIE' - processor instruction extensions. Before getting into the details of the P4, it's necessary to go over the history of the ESP32 family. The original ESP32, ESP32-S2 and ESP32-S3 all use Cadence's Xtensa LX CPUs. The release of the ESP32-C3 marked a turning point for Espressif with the use of RISC-V CPUs (no license fee). The ESP32-S3 is the last MCU in their lineup to use an Xtensa CPU. Espressif decided to add SIMD instructions (PIE) to the S3 to support more advanced imaging and machine learning tasks. The PIE instructions on the ESP32-S3 look a lot like Cadence's other SIMD instruction sets on their other CPUs. The ESP32-P4 however, has two 32-bit RISC-V CPUs inside. The RISC-V is an open source CPU design that is unrelated to Cadence's Xtensa CPUs. RISC-V's instruction set has advanced rapidly over the last few years and the working group has ratified several powerful sets of SIMD instructions; even the lowest level of these (1.0) would be quite impressive to have on a low cost microcontroller, but...they are not what Espressif used in the ESP32-P4 😒. I assume the decision to not use the RISC-V Vector instructions was due to the amount of silicon it would require to be compliant with the rvv1.0 standard. Instead, Espressif created a set of custom RISC-V instructions that closely match the ones used in the ESP32-S3. The result is an instruction set that initially will be a bit confusing to people familiar with writing SIMD code on the S3 - it uses RISC-V scalar instructions and ESP32-S3 vector instructions. Luckily both instruction sets are load/store type and manage registers in a very similar way. For someone used to writing SIMD code for the S3, transitioning to the P4 will be very easy.

Missing Documentation

At the time of this writing (April 6, 2026), Espressif hasn't yet released info about the ESP32-P4's PIE. There is a chapter reserved for it in their technical reference manual, but it's currently empty. There is a brief discussion here with a few details, but not quite enough to write your own code. I decided to fill in the missing pieces on my own. Last week I wrote some new ESP32-S3 SIMD code to accelerate GIF decoding. While it's still fresh in my mind, I decided to try porting this same code to the ESP32-P4. As a starting point, I needed a refresher on the RISC-V scalar register model and common instructions. 32-bit RISC-V (RV32) defines a programming model with 32 general purpose registers (x0-x31). These are referenced with a variety of names to simplify their use. Here's a compact table which summarizes it well:


We need to memorize this table to write code on the ESP32-P4. We can see from the list that function parameters are passed in a0-a7, returned in a0-a1 and that we can use t0-t6 in our functions without having to save and restore them. The main vector registers are identical between the two CPUs - there are 8 128-bit registers named q0-q7. The wide accumulator registers are different, but I don't make use of them in my code (yet). Besides the calling convention and register name differences, the scalar instructions we'll need to use are also different. The following table shows the equivalent instructions I used to translate my S3 SIMD code to work on the P4 (not in any particular order). One instruction that I'm still missing is the 'hardware loop' which shaves some cycles off of fixed count loops.

ESP32-S3 scalars              ESP32-P4 scalar equivalent
========================================================
movi.n a0, imm                li a0,imm
addi.n a0, imm                addi a0,imm
bnez.n a0, label              bne a0, x0, label
retw.n                        ret
loopnez a0, bottom_label      bnez bottom_label (at the top) + addi
a0,-1 + j top_label (at the bottom)
s8i a0, a1, 0                 sb a0, 0(a1)
s16i a0, a1, 0                sh a0, 0(a1)
s32i a0, a1, 0                sw a0, 0(a1)

Those are the scalar instructions that needed to be changed between my S3 and P4 code. The rest of my code is composed of 128-bit SIMD instructions and for the ones I used, the only difference is the instruction name prefix - 'ee' for ESP32-S3 SIMD and 'esp' for ESP32-P4 SIMD.

Code Example

A few days ago I was thinking about additional ways of speeding up GIF decoding. I revisited the ESP32-S3 SIMD instruction set and found one that I had previously overlooked - ee.ldxq.32. This instruction is basically a partial gather. For those unfamiliar, SIMD gather instructions allow you to accelerate reads from diverse addresses by using one register as a set of memory offsets that point to data which will be read into another register. This is useful for translating data through a lookup table, normally an operation that would have to be done one element at a time. The use of the gather instruction can't speed up the memory reads of those diverse addresses, but it can eliminate the extra time spent looping and packing the data into a new register. The ldxq instruction allows you to use the 8 16-bit lanes of each 128-bit SIMD register as an offset added to a base address register. The read is forced to be 32-bits and packed into on of 4 lanes of another 128-bit SIMD register. Here's how I use it to accelerate GIF palette lookups:

ee.xorq q1,q1,q1 # a register of 0's for zipping
ee.vzip.8 q0,q1 # stretch 8bit->16bit data
# first 8 pixels
ee.ldxq.32 q4,q0,a4,0,0 # load Q4:0 with first pixel's color
ee.ldxq.32 q4,q0,a4,1,1 # load Q4:1 with second pixel's color
ee.ldxq.32 q4,q0,a4,2,2
ee.ldxq.32 q4,q0,a4,3,3
ee.ldxq.32 q5,q0,a4,0,4
ee.ldxq.32 q5,q0,a4,1,5
ee.ldxq.32 q5,q0,a4,2,6
ee.ldxq.32 q5,q0,a4,3,7
ee.vunzip.16 q4,q5 # fold 32-bit values to 16-bits
ee.vst.128.ip q4,a5,16 # write 8 x RGB565 pixels
# second 8 pixels
ee.ldxq.32 q4,q1,a4,0,0
ee.ldxq.32 q4,q1,a4,1,1
ee.ldxq.32 q4,q1,a4,2,2
ee.ldxq.32 q4,q1,a4,3,3
ee.ldxq.32 q5,q1,a4,0,4
ee.ldxq.32 q5,q1,a4,1,5
ee.ldxq.32 q5,q1,a4,2,6
ee.ldxq.32 q5,q1,a4,3,7
ee.vunzip.16 q4,q5 # fold 32-bit values to 16-bits
ee.vst.128.ip q4,a5,16 # write 8 x RGB565 pixels

In the code above, I start with 16 x 8-bit pixels in Q0. I expand them to 16-bits for use as offsets with ldxq. I then pack all of the resulting reads from A4 plus these offsets into two Q registers, zip them back together to narrow the palette entries to 16-bits and then write them to the output pointer. The color palette entries need to be 32-bits apart to make this work. A small sacrifice of wasted space to get a significant speedup. This code, along with using SIMD for the simple operation of merging transparent and opaque pixels (compare+mask), gives my AnimatedGIF decoder a 20-30% overall speedup. The speedup of the two optimized operations (transparent pixel handling and palette lookup) is on the order of 8x. This is primarily due to using branchless operations for the groups of 8 and 16 pixels the SIMD code can work on at a time. As mentioned above, the merging of transparent pixels is a much simpler operation that any SIMD instruction set can do efficiently:

ee.vld.128.ip q0,a0,16 # load 16 source pixels
ee.vld.128.ip q1,a1,0 # load 16 destination pixels
ee.vcmp.eq.s8 q2,q0,q3 # compare with transparent color
ee.andq q4,q2,q1 # keep destination pixels using the mask
ee.notq q2,q2 # invert the mask
ee.andq q0,q2,q0 # keep source opaque pixels
ee.orq q0,q0,q4 # combine src+dst pixels

The code above uses the 8-bit compare instruction (ee.vcmp.eq.s8) to generate a set of 16 masks (0xff = true, 0x00 = false) to combine the new pixels with the old.  The result is stored back to the destination pointer and then translated through the palette to create the final output.

Test Rig

The hardware I used for testing the optimized GIF code are two versions of Elecrow's latest CrowPanel HMIs (note - these are affiliate links that help support my work):

The ESP32-S3 version: https://tidd.ly/4ttKD3Q


The ESP32-P4 version: https://tidd.ly/4sTKpDf

Both have a 5" 800x480 RGB panel (16-bit parallel) IPS LCD which is driven from the ESP32's PSRAM. This seemed like a good matchup for comparing the performance of the ESP32-S3 vs the ESP32-P4 and for testing the benefit of the SIMD optimizations on both.

Prerequisites

Before working with either of these products, I like to add support for them to my bb_spi_lcd library. It makes it much easier to create test programs if I can initialize the LCD display with a single line of code. The challenge is in finding the correct info for each product since there are multiple versions of them. After some searching and trial and error, I was able to find the correct info to initialize the displays on both products. In bb_spi_lcd, I've added them as DISPLAY_ELECROW_P4_800x480 and DISPLAY_ELECROW_S3_800x480.

The Test Sketch

The Arduino IDE is what I usually use for testing projects like this. It allows me to create a small sketch in a few minutes and quickly test that same code on different target processors without having to change the code or project configuration. For this project, I'm using SIMD instructions to optimize my AnimatedGIF library. I created a small test sketch here. There are multiple steps to decode and display a GIF animation:
  1. Read compressed data from the GIF file
  2. Decompress the data into 8-bit palette entries
  3. Merge opaque and transparent pixels with the previous frame contents
  4. Convert the palette entries into pixels using the color palette
  5. Display the pixels on the LCD
The unthrottled GIF playback rate depends on how much time is spent in each of these 5 steps. For this project, the animated GIF file data is read from FLASH memory so little latency is added by step 1. Step 2 takes the most time of all 5 and can't be sped up with SIMD since the compressed data uses variable length codes. SIMD instructions aren't designed to accelerate that type of data. Step 5 depends heavily on the type of display used; for this exercise we'll write directly to the local framebuffer in PSRAM. Writing to PSRAM adds some latency, but much less than the other steps. So we're left with steps 3 and 4 to optimize with SIMD. This pair of steps usually represents about 1/3 of the total decode time, so in the best of circumstances we could only hope to reduce the decode time by 1/3. Here are the actual results measured on the two boards: 

ESP32-P4
w/o SIMD 44 frames of 320x213 pixels 8-bpp, 876ms = 50.2fps
w SIMD 44 frames in 759ms = 58fps

ESP32-S3
w/o SIMD 44 frames of 320x213 pixels 8-bpp, 1549ms = 28.4fps
w SIMD 44 frames in 1401ms = 31.4fps

Conclusion

GIF decoding was a good challenge for SIMD optimization - I haven't seen any other open source libraries attempt it. I assume the reason is because most PCs and mobile devices don't currently have a CPU which supports some kind of gather instruction, so there wouldn't be any real benefit to trying to optimize GIF playback with SIMD. Luckily the ESP32 S3 and P4 have an instruction which shaves precious cycles off of the palette lookup step. It was a fun challenge to find a new way to improve GIF decoding speed. The speed improvement isn't tremendous, but it is significant. The SIMD function itself runs many times faster than the C code it replaces, but as shown early, this can't have a huge impact on the overall playback speed. I hope this article provides the necessary info to help you port your ESP32-S3 SIMD code to the ESP32-P4.

What do you think? Feel free to leave your comments below.

Comments

Popular posts from this blog

How much current do OLED displays use?

Fast SSD1306 OLED drawing with I2C bit banging

How to speed up your project with DMA