I put my ESP32-S3 dev board from PCBWay through a quick performance workout by decoding a baked-in animated GIF with Larry Bank’s decoder and tweaking ESP-IDF settings. Cranking the CPU to 240MHz gave the expected ~1.5× bump, -Os beat -O2, switching flash from DIO to QIO shaved a bit more, and turning the caches up to 11 pushed it further. Best combo: 240MHz, -Os, QIO, max caches (with a larger partition and watchdog off). Nice little speed win.
The ESP32-S3 is a pretty amazing CPU.
It’s a dual core RISC-V processor.
I’m using my dev board that I got made by PCBWay.
It came out pretty well.
There’s a module on here.
It’s pretty normal.
There’s nothing special about it.
It does have eight megabytes of flash and eight megabytes of PS RAM.
I’ve been doing some projects recently where I really need to squeeze
the most bang for my buck out of the CPU.
I thought I’d try out some settings on the ESP-IDF
and see what gives us the most power out of the CPU.
I set up a project here.
It’s a very simple project.
All I’m doing is decoding a GIF.
I’ve baked this animated GIF into the flash
and I’m using Larry Bitbank’s amazing animated GIF decoder.
So at the moment my configuration has been reset to the out of the box
what you get with the Hello World application.
There’s just a few things I need to do to make sure our code actually runs.
The first thing is our animated GIF is actually quite big.
I need to set the partition table so it actually fits.
We’ll do a large application and I also need to turn off the watchdog timer
because we’re doing some performance code
so I don’t want to actually trigger the watchdog timer and kill my application.
So I’ll just turn that off.
If we save this and build in flash we get a nice baseline measurement
of how fast our code actually runs.
That’s our test run.
We’ve got our initial baseline value so let’s just record that.
So it’s like 1.4 seconds is the total time.
Look at the milliseconds.
That’s our baseline.
Now obviously the first thing you can do on the ESP32
is you can control the CPU clock speed.
An obvious thing to change is just switch to 240 megahertz instead of the default 160.
So let’s quickly run that and see how that works out.
Okay so we’ve got the results.
Let’s copy that into our spreadsheet.
240 megahertz and we should expect around a 1.5 times improvement which is what we see here.
Now there’s more options we can try so if we look at the compiler options
then we have our optimization level.
So currently we’re just compiling for debug.
Now there’s two interesting options here.
We have optimize for size and optimize for performance.
Now I had endless debates in my first job around which was best.
There is a strong argument that optimize for size is the best thing to choose.
Firstly we’re trying to fit our program into a small amount of flash.
Obviously my board has 8 megabytes of flash quite a lot but most modules have around 2 megabytes.
So you might want to optimize for size just so you can fit your program on the actual device.
The other reason for choosing optimize for size is that many CPUs have instruction caches.
If you make your code smaller it’s more likely to fit in the cache and you’ll get fewer cache misses.
Let’s optimize for size first and see how well that works.
There we go another improvement.
We’re even faster.
Let’s try out the optimize for speed and see if that does something else.
Optimize for performance.
That’s pretty interesting.
Our time’s actually gone back up.
Maybe the arguments for optimize for size are actually quite valid.
So let’s switch back to optimize for size.
That seems to give us the best result.
Now the other interesting thing we could do is we are reading quite a lot of erb data from flash.
So we have an embedded GIF and it’s quite large.
Now an interesting setting we can try changing is the flash SPI mode.
So by default this runs on DIO.
You can change this to QIO.
Now this is interesting because it controls how fast the chip is flashed
but also how fast code is run from it.
So if we save this let’s see what effect it has.
So it’s actually had a pretty significant effect.
We’ve shaved off about 1.2 percent from our time.
That’s pretty impressive.
Now there are some even more esoteric options.
So if we look down at the CPU section.
If I can find it.
So there’s all of these cache configuration options.
So we can bump up the instruction cache size.
It defaults to eight ways.
We can also do the data cache size so 64 kilobytes.
And we can do the data cache line size of 64 bytes.
So let’s see what effect these settings have.
So we’ve turned it up to number 11.
Very special because if you can see the numbers all go to 11.
Look right across the board.
11, 11, most of the amps go up to 10.
And it’s actually improved the performance even faster.
So that’s pretty cool.
Let’s just fix my spelling of number.
Now I’m kind of intrigued as to if we now switch back to 02
with our new cache sizes.
Does it have an improvement?
So let’s try that.
Go back to compiler options.
Optimize for performance.
So in theory bigger cache sizes should mean that we get the benefits
from our optimized for performance without needing to optimize for size.
Because our code should now fit in the larger caches.
But let’s see what actually happens.
So that’s very interesting.
I mean it has improved our 02 value.
So before it was 962 milliseconds.
It’s now 933 milliseconds.
But it’s still not as good as osmall plus all the other changes and the caches.
Which seems to be our winner.
So that’s pretty interesting.
So I think that’s our best combination of things.
So optimize for size.
Turn the caches up to maximum.
Obviously 240 megahertz.
And switch on QIO if you can.
So pretty interesting.
So I think this is our number one winner.
So I will use that for my performance critical code from now on.
So pretty interesting.
Thanks for watching.
Well I hope you found it interesting.
It was a bit code heavy but interesting for me.