ESP32-S3 SIMD Minimal Example

In my recent blog post, I described the ESP32-S3 SIMD instructions and how I figured out how to use them. This post is to provide the simplest working example to serve as a starting point for your own projects. The code was designed for the Arduino IDE, but will work similarly on Espressif's ESP-IDF. To add assembly language files (.S) to your Arduino project, just place them in the same folder as the .ino file. Here are the 2 files needed for this project, along with the output on the serial terminal of the results when you run it. Simply create an empty project and copy/paste this code into it:

The Arduino sketch

//
// ESP32-S3 minimal SIMD example
// Written by Larry Bank
// Copyright (c) 2024 BitBank Software, Inc.
//
// The purpose of this example is to show how to make use of ESP32-S3 SIMD instructions
// in your Arduino or ESP-IDF projects. The code is not comprehensive and just provides
// a starting point for someone wanting to learn how to use them. I wrote this because
// I couldn't find such an example and thought that people would appreciate saving some
// time with the research I did.
//

// The ADD instruction always saturates the results, so notice what happens to value 7
// in the output

extern "C" {
int s3_add16x8(int16_t *pA, int16_t *pB, int16_t *pC);
}
// 128-bit (16-byte) loads and stores need to be 16-byte aligned
int16_t __attribute__((aligned (16))) u16_A[8] = {0x00, -0x100, 0x00, 0x1111, 0x00, 0x1234, 0x00, 0x7fff};
int16_t __attribute__((aligned (16))) u16_B[8] = {0x00, 0x3000, 0x00, 0x2222, 0x00, 0x4321, 0x00, 0x4000};
int16_t __attribute__((aligned (16))) u16_C[8] = {0};

void setup() {

Serial.begin(115200);
delay(3000); // wait for USB-CDC to start
Serial.println("About to call Asm code");
s3_add16x8(u16_A, u16_B, u16_C);
Serial.println("Returned from Asm code");
for (int i=0; i<8; i++) {
Serial.printf("value %d = 0x%04x\n", i, u16_C[i]);
}
} /* setup() */

void loop() {
} /* loop() */

The s3_simd.S file

// // ESP32-S3 SIMD example // Written by Larry Bank // Copyright (c) 2024 BitBank Software, Inc. // #include "dsps_fft2r_platform.h" #if (dsps_fft2r_sc16_aes3_enabled == 1) .text .align 4 // Simple signed 16-bit x 8 add // registers with the args: A2 A3 A4 // Call as int s3_add16x8(int16_t *pA, int16_t *pB, int16_t *pC); .global s3_add16x8 .type s3_add16x8,@function s3_add16x8: entry a1,16 # prepare windowed registers and reserve 16 bytes of stack ee.vld.128.ip q0,a2,16 # load 8 "A" values into Q0 from A2, then add 16 to A2 ee.vld.128.ip q1,a3,16 # load 8 "B" values into Q1 from A3, then add 16 to A3 ee.vadds.s16 q2,q0,q1 # C = A+B (with saturation) ee.vst.128.ip q2,a4,16 # store the 8 "C" values, then add 16 to A4 movi.n a2,0 # return value of 0 retw.n # restore state (windowed registers) and return to caller #endif // dsps_fft2r_sc16_aes3_enabled

The Serial Terminal Output

About to call Asm code
Returned from Asm code
value 0 = 0x0000
value 1 = 0x2f00
value 2 = 0x0000
value 3 = 0x3333
value 4 = 0x0000
value 5 = 0x5555
value 6 = 0x0000
value 7 = 0x7fff

Comments

  1. Great post, Larry! I've been working on similar ESP32-S3 SIMD optimizations for neural networks. Your examples are incredibly helpful and align perfectly with challenges I've encountered.

    I just published a blog about implementing QSiLUApprox activation functions using SIMD on ESP32-S3, tackling some unique challenges:
    - Creating custom macros for int8↔int16 conversions with proper sign extension
    - Implementing vector compression techniques
    - Developing SIMD-friendly absolute value computation

    Your example will be useful as I continue optimizing this implementation. If you're interested in neural network optimizations for ESP32, I'd love to get your thoughts on my approach: https://boumedinebillal.blogspot.com/2025/03/esp32-p4-deep-learning-pipeline-update.html

    Thanks for sharing your knowledge in this area!

    ReplyDelete
    Replies
    1. I'm glad you found it useful. I'm not doing any ML projects with the ESP32 at the moment. My focus is on speeding up pixels.

      Delete
  2. Great post, Larry! I've been working on similar ESP32-S3 SIMD optimizations for neural networks. Your examples are incredibly helpful and align perfectly with challenges I've encountered.

    I just published a blog about implementing QSiLUApprox activation functions using SIMD on ESP32-S3, tackling some unique challenges:
    - Creating custom macros for int8↔int16 conversions with proper sign extension
    - Implementing vector compression techniques
    - Developing SIMD-friendly absolute value computation

    Your example will be useful as I continue optimizing this implementation. If you're interested in neural network optimizations for ESP32, I'd love to get your thoughts on my approach: https://boumedinebillal.blogspot.com/2025/03/esp32-p4-deep-learning-pipeline-update.html

    Thanks for sharing your knowledge in this area!

    ReplyDelete

Post a Comment

Popular posts from this blog

Surprise! ESP32-S3 has (a few) SIMD instructions

How much current do OLED displays use?

Fast SSD1306 OLED drawing with I2C bit banging