ESP32-P4 SIMD Explained
Introduction
N.B. - This article is for programmers who are familiar with SIMD.Missing Documentation
At the time of this writing (April 6, 2026), Espressif hasn't yet released info about the ESP32-P4's PIE. There is a chapter reserved for it in their technical reference manual, but it's currently empty. There is a brief discussion here with a few details, but not quite enough to write your own code. I decided to fill in the missing pieces on my own. Last week I wrote some new ESP32-S3 SIMD code to accelerate GIF decoding. While it's still fresh in my mind, I decided to try porting this same code to the ESP32-P4. As a starting point, I needed a refresher on the RISC-V scalar register model and common instructions. 32-bit RISC-V (RV32) defines a programming model with 32 general purpose registers (x0-x31). These are referenced with a variety of names to simplify their use. Here's a compact table which summarizes it well:
We need to memorize this table to write code on the ESP32-P4. We can see from the list that function parameters are passed in a0-a7, returned in a0-a1 and that we can use t0-t6 in our functions without having to save and restore them. The main vector registers are identical between the two CPUs - there are 8 128-bit registers named q0-q7. The wide accumulator registers are different, but I don't make use of them in my code (yet). Besides the calling convention and register name differences, the scalar instructions we'll need to use are also different. The following table shows the equivalent instructions I used to translate my S3 SIMD code to work on the P4 (not in any particular order). One instruction that I'm still missing is the 'hardware loop' which shaves some cycles off of fixed count loops.ESP32-S3 scalars ESP32-P4 scalar equivalent
========================================================
movi.n a0, imm li a0,imm
addi.n a0, imm addi a0,imm
bnez.n a0, label bne a0, x0, label
retw.n ret
loopnez a0, bottom_label bnez bottom_label (at the top) + addi
a0,-1 + j top_label (at the bottom)
s8i a0, a1, 0 sb a0, 0(a1)
s16i a0, a1, 0 sh a0, 0(a1)
s32i a0, a1, 0 sw a0, 0(a1)
Those are the scalar instructions that needed to be changed between my S3 and P4 code. The rest of my code is composed of 128-bit SIMD instructions and for the ones I used, the only difference is the instruction name prefix - 'ee' for ESP32-S3 SIMD and 'esp' for ESP32-P4 SIMD.
Code Example
A few days ago I was thinking about additional ways of speeding up GIF decoding. I revisited the ESP32-S3 SIMD instruction set and found one that I had previously overlooked - ee.ldxq.32. This instruction is basically a partial gather. For those unfamiliar, SIMD gather instructions allow you to accelerate reads from diverse addresses by using one register as a set of memory offsets that point to data which will be read into another register. This is useful for translating data through a lookup table, normally an operation that would have to be done one element at a time. The use of the gather instruction can't speed up the memory reads of those diverse addresses, but it can eliminate the extra time spent looping and packing the data into a new register. The ldxq instruction allows you to use the 8 16-bit lanes of each 128-bit SIMD register as an offset added to a base address register. The read is forced to be 32-bits and packed into on of 4 lanes of another 128-bit SIMD register. Here's how I use it to accelerate GIF palette lookups:
ee.xorq q1,q1,q1 # a register of 0's for zippingee.vzip.8 q0,q1 # stretch 8bit->16bit data
# first 8 pixels
ee.ldxq.32 q4,q0,a4,0,0 # load Q4:0 with first pixel's color
ee.ldxq.32 q4,q0,a4,1,1 # load Q4:1 with second pixel's color
ee.ldxq.32 q4,q0,a4,2,2
ee.ldxq.32 q4,q0,a4,3,3
ee.ldxq.32 q5,q0,a4,0,4
ee.ldxq.32 q5,q0,a4,1,5
ee.ldxq.32 q5,q0,a4,2,6
ee.ldxq.32 q5,q0,a4,3,7
ee.vunzip.16 q4,q5 # fold 32-bit values to 16-bits
ee.vst.128.ip q4,a5,16 # write 8 x RGB565 pixels
# second 8 pixels
ee.ldxq.32 q4,q1,a4,0,0
ee.ldxq.32 q4,q1,a4,1,1
ee.ldxq.32 q4,q1,a4,2,2
ee.ldxq.32 q4,q1,a4,3,3
ee.ldxq.32 q5,q1,a4,0,4
ee.ldxq.32 q5,q1,a4,1,5
ee.ldxq.32 q5,q1,a4,2,6
ee.ldxq.32 q5,q1,a4,3,7
ee.vunzip.16 q4,q5 # fold 32-bit values to 16-bits
ee.vst.128.ip q4,a5,16 # write 8 x RGB565 pixels
In the code above, I start with 16 x 8-bit pixels in Q0. I expand them to 16-bits for use as offsets with ldxq. I then pack all of the resulting reads from A4 plus these offsets into two Q registers, zip them back together to narrow the palette entries to 16-bits and then write them to the output pointer. The color palette entries need to be 32-bits apart to make this work. A small sacrifice of wasted space to get a significant speedup. This code, along with using SIMD for the simple operation of merging transparent and opaque pixels (compare+mask), gives my AnimatedGIF decoder a 20-30% overall speedup. The speedup of the two optimized operations (transparent pixel handling and palette lookup) is on the order of 8x. This is primarily due to using branchless operations for the groups of 8 and 16 pixels the SIMD code can work on at a time. As mentioned above, the merging of transparent pixels is a much simpler operation that any SIMD instruction set can do efficiently:
ee.vld.128.ip q1,a1,0 # load 16 destination pixels
ee.vcmp.eq.s8 q2,q0,q3 # compare with transparent color
ee.andq q4,q2,q1 # keep destination pixels using the mask
ee.notq q2,q2 # invert the mask
ee.andq q0,q2,q0 # keep source opaque pixels
ee.orq q0,q0,q4 # combine src+dst pixels
Test Rig
Prerequisites
The Test Sketch
- Read compressed data from the GIF file
- Decompress the data into 8-bit palette entries
- Merge opaque and transparent pixels with the previous frame contents
- Convert the palette entries into pixels using the color palette
- Display the pixels on the LCD


Comments
Post a Comment