Surprise! ESP32-S3 has (a few) SIMD instructions

Intro

Espressif Systems released their ESP32-S3 SoC a few years ago, but only recently have they released more documentation and support of its full capabilities. Without any changes to your code, the S3 runs about 15% faster than older ESP32 CPUs at the same clock speed. It has a 'hidden' capability that's more difficult to use, but can be worth the effort if you need more speed. This article is aimed at programmers who are already familiar with SIMD instructions on other platforms.

I've been optimizing code with SIMD for more than 15 years on Intel, Arm and DSPs (even Cadence's), so when I heard that the S3 had SIMD instructions, I immediately went searching for documentation. When the S3 became available to buy, there was only a promise of documentation and support. In the 2+ years since then, not much has changed. At the end of 2023, Espressif released a document describing the new instructions:

S3 Technical Reference Manual

The document has a decent level of detail and only a few errors, but what's missing still are examples and documentation on how to use them in your own code - conspicuously absent are instructions on using the assembler and linking them to your C/C++ code. This isn't entirely the fault of Espressif. The Xtensa processor comes from Cadence and for some reason they like to keep everything under NDA, even information which would help people use their processors. I find it hard to understand why the instruction set should be kept secret; a CPU vendor should make it as easy as possible for engineers to use their CPUs. The 'trade secrets' are in the hardware design, not in the instruction set. I've worked with Cadence's DSPs before, so I'm familiar with their way of doing things. Their Vision DSPs have a robust and powerful instruction set. Unfortunately, the S3 has a very minimal set of SIMD instructions,  probably due to cost and silicon area limits.

Since the SIMD 'Processor Extension' is treated as a coprocessor, the main instruction set must be mixed in the code. Here's the document for the main LX7 instructions:

Xtensa ISA

The programmer model consists of 16 general purpose/address registers (a0-a15), 8 128-bit wide SIMD registers (q0-q7), and two special accumulator registers for multiply/accumulate operations. The memory bus is documented as 128-bits wide, so it is definitely advantageous to read and write to memory at the native width. There are also some instructions to manipulate GPIO bits.

How I got started with S3 SIMD

I spent the better part of a day searching and experimenting with these instructions until I got working code. I started with a search on Github for any public repos containing the one instruction needed for any S3 SIMD project - load (ee.vld.128). A few hits popped up in Espressif's esp-dsp project. A lot of their ESP32-S3 code is closed source, but a few functions pulled back the veil on how to use them in my own projects. This is the code I used as a starting point:

https://github.com/espressif/esp-dsp/blob/master/modules/fft/fixed/dsps_fft2r_sc16_aes3.S

I tried putting multiple functions into a single .S file, but that doesn't seem to work so every function gets it's own file. My first use case for these instructions is to optimize the color conversion step of my JPEG decoder. The YCbCr->RGB step takes a significant amount of time and is a good fit for SIMD optimization.

The Instruction Set

I've written SIMD code for the pixel color conversion multiple times in multiple SIMD instruction sets and what struck me with the S3 was how little I had to work with. One of the main sticking points is that even though the instruction encodings are 24-bits each, there's no bits reserved for shift amount. There are explicit shift instructions and even they don't have the shift amount encoded. The multiply instructions can shift right after multiply, but the shift amount must first be loaded into the SAR (shift amount register). This requires 2 additional instructions (and potentially additional pipeline cycles) to accomplish. The main sticking points which make it harder to get work done are that the instructions are not orthogonal across data sizes and are missing a lot of things needed to make efficient code. For example, the shift left and right instructions only operate on 32-bit values and only do arithmetic shifting (carry the sign bit), not logical. To generate RGB565 output, I need to shift 16-bit values. My workaround was to multiply by 1 and set the SAR for right shifting and multiply by a power of 2 and set the SAR to 0 for left shifting.

Here's a short list of what I consider essential SIMD features that are missing on the S3:

- Shift right logical
- Shift 8 or 16-bit values
- Add or multiply with widening
- Right shift with narrowing
- Add and subtract without saturation
- Unaligned reads and writes

Also missing (Nice to haves) that other Cadence DSPs have:

- Scatter / gather writes/reads
- Horizontal vector operations
- Rearrange vector element order
- Floating point support
- Predicated operations (operate on select elements only)

The gotcha that had me going in circles for a while is the memory alignment restriction. In my JPEG decoder internal data structure, some of the elements are aligned on 4-byte boundaries. S3 SIMD load and store instructions can at best access memory on 8-byte boundaries, but ideally want everything on 16-byte boundaries. I fixed this by inserting a "long double" in front of the items that need 16-byte alignment 😀.

Where to go from here...

I'm going to add some optimized functions to my various imaging libraries where appropriate and see where it takes me. I did an initial test with my JPEGDEC library and saw a nearly 40% speedup (14ms -> 10ms) by using the S3 SIMD instructions for the color conversion step. I'll publish this code after I've had time to fully flesh it out and test it. Good luck with your use of S3 SIMD...


Comments

Popular posts from this blog

How much current do OLED displays use?

Fast SSD1306 OLED drawing with I2C bit banging

Building the Pocket CO2 Project