I turned a vanilla ESP32-S3 dev board into a USB UVC webcam that doesn’t use a camera at all—first streaming a static test card, then an animated GIF, and finally a real-time Pong game. The ESP32 pre-decodes GIF frames to RGB, JPEG-encodes them, and streams MJPEG, and for the live game it renders to a framebuffer, JPEG-encodes in ~23 ms, and just about hits 30 fps. There’s room to optimize (dual-core draw/encode), and this approach is great for dashboards, sensor visualizations, or testing video pipelines. Shout out to PCBWay for the boards—they turned out great.
This is a webcam, but it’s not really a
camera. It’s my ESP32,
and it’s running Pong. So, how does this
even work? Well, to my computer, this
ESP32 devboard is just a totally normal
webcam.
So, this is one of my ESP32 S3 dev
boards. There’s nothing special about
it. It is just a run-of-the-mill dev
board. A quick shout out to PCB Way as
they fabricated these boards for me.
They handled the PCBs and I did the
assembly and the boards came out really
nicely. There’s a link to PCB way in the
description if you want to check them
out. Now, the nice thing about the S3
and other expressive IC’s is that they
have native support. This means that
they can behave like USB devices and
they can even act as USB hosts. So, you
can connect USB devices to them, but
we’re using it in USB device mode. Now,
once it’s been set up to say, “I’m a
webcam,” the computer doesn’t care what
it actually is. It just sees a webcam.
And as long as the ESP32 sends a stream
of frames over USB, it just works. So,
in this video, we’ve got three demos.
First, we have a fairly boring test
card. Then, we have an animated GIF. And
finally, we’ve got a real-time game. So,
let’s start off simple.
So, this is possibly the most boring
webcam in the world. We’re using the
expressive USB UVC device component,
which lets the ESP32 act as a standard
USB webcam. We’re sending the video as
motion JPEG or MJPEG. Now, MJPEG is
literally just a stream of JPEG images
sent one after another. There’s nothing
very clever about it. In this example,
I’ve baked a single JPEG into the
firmware. And every time our computer
asks for a frame, we just send the same
JPEG data over and over again. As far as
the computer is concerned, it’s getting
a video stream, even though we’re just
sending the same image repeatedly. It
doesn’t really care. It just sees a
stream.
So, demo two. A static image is fine.
But webcams are meant to move. For this
demo, I baked an animated GIF into the
firmware. When the ESP30G boots up, I
decode each frame into an RGB frame
buffer. I’m using Larry Bank’s fantastic
animated GIF decoding code for this. We
then encode that frame buffer into a
JPEG image. We do all this processing up
front and then just stream the
pre-enccoded JPEG frames. The GIF file
already contains timing for the frames,
so all we need to do is keep track of
the time and switch new frames in at the
right moment. I’ve kept the resolution
pretty sensible at 320x 242 pixels.
There’s a really nice online tool called
Easy Gift that you can use to resize and
optimize gifts to get them down to a
sensible size. Typically on most
modules, you might have 2 megabytes of
flash to play with, so you need to get
them quite small. Doing all the heavy
lifting up front, decoding the GIF and
encoding the JPEGs is pretty nice. We
are trading memory for performance,
which makes streaming fast and simple.
But what if we want to generate content
on the fly? Can we do it at a decent
frame rate, say 30 frames pers? Looking
at the logs from the encoding and
decoding, we can see the JPEG encoding
on average takes just under 23
milliseconds. Now, at 30 frames pers,
you only have 33 milliseconds total.
budget. So, the JPEG encoding is going
to take a huge bite out of that time.
This means whatever we’re doing has to
be simple and fast, which does make Pong
a pretty good test.
So, here we have Pong running entirely
on the ESP32 and being displayed on our
screen. We run a very simple loop. We
update the game state, render that into
a frame buffer, then encode the frame
buffer as a JPEG, and send it over USB.
I have baked some stats into the game
display and we are just about getting 30
frames per second. The E stat is the
average JPEG encode time and the D stat
is the game update and drawing time.
There are definitely ways to make this
more efficient. It is a dual core chip.
So you could draw on one core while the
other core handles JPEG encoding.
There’s plenty of optimizations you
could do. Now there’s plenty of things
you can do. I’ve just done a very simple
game, but you could use this for
dashboards, visualizing sensor data, or
even testing video pipelines. So, if
you’re doing simple dashboards, you
probably don’t need 30 frames per
second. You could probably update it
every 1 second. So, the JPEG encoding
wouldn’t really matter that much. But,
there’s many things you could do. So, if
you got any ideas, put them in the
comments and we’ll try them