In this tutorial we will learn how to detect faces using the ESP32 and a camera.
Introduction
In this tutorial we will learn how to detect faces using the ESP32 and a camera. We will be using the Arduino core to program the microcontroller.
I’ll be using a HW-818 camera board, which contains all the electronics needed to connect the camera (just need to plug it to an onboard connector), a USB connector to program the ESP32 and also 4 MB of external PPSRAM. At the time of writing, this camera board can be bought at eBay for around 10 euros (link here). I’m using a OV2640 camera model, which shipped together with my board.
Although the pin definitions used in the code below are specific for the camera model I’m using, the rest of the code should be generic for other models.
Note that the code shown below was the minimum needed to be able to detect the images of the camera, so we can start understanding the basics before moving to more complex tutorials. So, it won’t show us the images detected or draw the rectangles of the faces, but rather just print a simple message to the serial monitor when a face is detected on a captured image.
Nonetheless, it is possible to do such advanced applications, as it is demonstrated on the Arduino example for the camera. If you haven’t done it yet, my recommendation is to check that example and run it first, to make sure your hardware is working correctly before moving to custom code.
Also, the code shown below was put together in part by analyzing the implementation of the Arduino core example, which is a very good source of information for those learning how to work with the ESP32 and a camera.
Other important sources of documentation used for this tutorial were the following:
- The face detection ESP32 component (link here).
- ESP-WHO – the face detection and recognition platfrom from Espressif (link here).
- The ESP32 camera driver (link here).
Although we will be using the Arduino core, at the time of writing there is no higher level wrapper for the camera image capture and face detection. So we will be interacting with some lower level libraries.
The detection of faces provided by these libraries is based on the MTMN model [1].
If you prefer a video version of this tutorial, please check my YouTube channel below.
The code
We will start our code by the library includes. We will need the esp_camera.h, which will allows us to initialize the camera and interact with it, and the fd_forward.h, which exposes to us the function to detect faces in the image.
#include "esp_camera.h"
#include "fd_forward.h"
After that we are going to add some defines that will correspond to the numbers of the board pins connected to the camera. Since I’m using a HW-818 model, I can use the same pin definitions as the AI THINKER model. If your board model is different, you should adapt these defines. You can find the defines for the most common models on this file from the Arduino core camera example.
#define PWDN_GPIO_NUM 32
#define RESET_GPIO_NUM -1
#define XCLK_GPIO_NUM 0
#define SIOD_GPIO_NUM 26
#define SIOC_GPIO_NUM 27
#define Y9_GPIO_NUM 35
#define Y8_GPIO_NUM 34
#define Y7_GPIO_NUM 39
#define Y6_GPIO_NUM 36
#define Y5_GPIO_NUM 21
#define Y4_GPIO_NUM 19
#define Y3_GPIO_NUM 18
#define Y2_GPIO_NUM 5
#define VSYNC_GPIO_NUM 25
#define HREF_GPIO_NUM 23
#define PCLK_GPIO_NUM 22
We will also define a function to take care of the camera initialization. We will then call this function on the Arduino setup. The initialization procedure was covered in more detail on this previous tutorial.
In short, our initialization function will be called initCamera, take no arguments and return a Boolean indicating if the procedure was successful or not. In its implementation we will declare a variable of type camera_config_t, which is a struct that will hold the initialization parameters for the camera.
Among the multiple initialization parameters, we will set the frame size to QVGA, which is the recommended resolution [1] for the face detection.
The complete function can be seen below.
bool initCamera() {
camera_config_t config;
config.ledc_channel = LEDC_CHANNEL_0;
config.ledc_timer = LEDC_TIMER_0;
config.pin_d0 = Y2_GPIO_NUM;
config.pin_d1 = Y3_GPIO_NUM;
config.pin_d2 = Y4_GPIO_NUM;
config.pin_d3 = Y5_GPIO_NUM;
config.pin_d4 = Y6_GPIO_NUM;
config.pin_d5 = Y7_GPIO_NUM;
config.pin_d6 = Y8_GPIO_NUM;
config.pin_d7 = Y9_GPIO_NUM;
config.pin_xclk = XCLK_GPIO_NUM;
config.pin_pclk = PCLK_GPIO_NUM;
config.pin_vsync = VSYNC_GPIO_NUM;
config.pin_href = HREF_GPIO_NUM;
config.pin_sscb_sda = SIOD_GPIO_NUM;
config.pin_sscb_scl = SIOC_GPIO_NUM;
config.pin_pwdn = PWDN_GPIO_NUM;
config.pin_reset = RESET_GPIO_NUM;
config.xclk_freq_hz = 20000000;
config.pixel_format = PIXFORMAT_JPEG;
config.frame_size = FRAMESIZE_QVGA;
config.jpeg_quality = 10;
config.fb_count = 1;
esp_err_t result = esp_camera_init(&config);
if (result != ESP_OK) {
return false;
}
return true;
}
We will also declare a variable of type mtmn_config_t, which is a struct that will hold the configurations of MTMN that will be used to detect the faces [1]. As we will see below, for this tutorial we will be using the default configurations. Nonetheless, these can be fine tuned, as can be seen here.
mtmn_config_t mtmn_config = {0};
Additionally, we will define a variable that will count how many times we have detected faces. We will initialize it with the value zero and increment it every time faces are detected in a frame.
int detections = 0;
Moving on to the Arduino setup, we start by opening a serial connection, so we can output a message when we detect faces on the captured image. After that, we will call the initCamera and check the returned value to confirm the camera was initialized properly.
Serial.begin(115200);
if (!initCamera()) {
Serial.printf("Failed to initialize camera...");
return;
}
After that we will call the mtmn_init_config function, which takes no arguments and returns a set of default MTMN configurations we can use right away to start detecting faces in the camera images.
Naturally, if it makes sense for your application, you can instead set your own configurations, like already mentioned.
mtmn_config = mtmn_init_config();
The complete setup function can be seen below.
void setup() {
Serial.begin(115200);
if (!initCamera()) {
Serial.printf("Failed to initialize camera...");
return;
}
mtmn_config = mtmn_init_config();
}
We will write the rest of our code in the Arduino main loop. There, we will continuously get an image, check for faces and print a message to the serial port in case we found any. Note that we are not doing any error checking below to keep the code simple, but for a real application scenario you should verify the result of each operation and act accordingly.
The first thing we need to do is obtaining a camera frame. So, we start by declaring a variable that will hold a pointer to a struct of type camera_fb_t.
Like seen on the previous tutorial, this struct will hold a pointer to the buffer containing the actual image and also some metadata such as the width and the height of the image and the length of the buffer that contains it.
camera_fb_t * frame;
Then we will call the esp_camera_fb_get function to get an image from the camera. This function takes no arguments and returns a pointer to a camera_fb_t struct, which we will store on our previously declared variable.
frame = esp_camera_fb_get();
Before we start looking for faces in the image, we first need to take in consideration that the function we will use for that works with a struct of type dl_matrix3du_t (you can check the fields it contains here). So, we will need to be able to convert the captured image to this format.
The first thing we will do is allocating memory to the dl_matrix3du_t struct that will hold the image to be processed. We do so with a call to the dl_matrix3du_alloc function. It receives the following parameters [3]:
- Number of matrix3d. Should be 1 for our use case.
- Width of the matrix. Should be equal to the width of our image.
- Height of the matrix. Should be equal to the height of our image.
- Number of channels. Should be 3, since we will be working with a RGB image.
As output, this function will return a pointer to the allocated matrix struct.
dl_matrix3du_t *image_matrix = dl_matrix3du_alloc(1, frame->width, frame->height, 3);
Additionally to working with this struct type, the detection function also expects the image to be in the RGB888 format [4]. So, we will call the fmt2rgb888 function, which will convert our original image (in JPEG) to the RGB888 format.
This function receives the following parameters [5]:
- Pointer to the source buffer. It can be in JPEG, RGB565, YUYV or GRAYSCALE formats, since the function is able to work with all of them. In our case, we have our original image in the JPEG format. We can obtain the original image buffer from the camera_fb_t struct, on the buf field.
- Length of the source buffer, in bytes. We can also obtain it from the previously mentioned struct, on the len field.
- Format of the source image. Although we know it is JPEG, we can also obtain it from the camera_fb_t struct, on the format field.
- Pointer to the destination buffer. It should be the field items from our dl_matrix3du_t struct.
fmt2rgb888(frame->buf, frame->len, frame->format, image_matrix->item);
Since we already have the image in the final buffer, we can call the esp_camera_fb_return, passing as input the pointer to the camera_fb_t struct we have originally obtained. This function call will allow the image buffer to be reused again, which makes sense since we will continuously grab new images and we don’t need to keep the old ones.
esp_camera_fb_return(frame);
Then, we are finally going to call the face_detect function. This function receives as first input the pointer to the dl_matrix3du_t matrix containing the image in RGB888 format and as second input the address of the MTMN configuration to be used.
As output, it will return a pointer to a struct of type box_array_t, which contains the boxes inside which the face (or faces) were detected. In case no face was detected, the function will return NULL.
box_array_t *boxes = face_detect(image_matrix, &mtmn_config);
For this simple tutorial we just want to know if faces were detected or not in the image, so we are not going to look into the details of the returned struct. We are simply going to check if this pointer is different from NULL and, if it is, increase the detection counter and print its value to the serial port.
In case boxes were found, we need to free some of the fields contained in the box_array_t struct. However, we should use the dl_lib_free function rather than a regular free.
if (boxes != NULL) {
detections = detections+1;
Serial.printf("Faces detected %d times \n", detections);
dl_lib_free(boxes->score);
dl_lib_free(boxes->box);
dl_lib_free(boxes->landmark);
dl_lib_free(boxes);
}
We also need to free our matrix with a call to the dl_matrix3du_free function.
dl_matrix3du_free(image_matrix);
The main loop can be found below.
void loop() {
camera_fb_t * frame;
frame = esp_camera_fb_get();
dl_matrix3du_t *image_matrix = dl_matrix3du_alloc(1, frame->width, frame->height, 3);
fmt2rgb888(frame->buf, frame->len, frame->format, image_matrix->item);
esp_camera_fb_return(frame);
box_array_t *boxes = face_detect(image_matrix, &mtmn_config);
if (boxes != NULL) {
detections = detections+1;
Serial.printf("Faces detected %d times \n", detections);
dl_lib_free(boxes->score);
dl_lib_free(boxes->box);
dl_lib_free(boxes->landmark);
dl_lib_free(boxes);
}
dl_matrix3du_free(image_matrix);
}
The complete code can be seen below.
#include "esp_camera.h"
#include "fd_forward.h"
#define PWDN_GPIO_NUM 32
#define RESET_GPIO_NUM -1
#define XCLK_GPIO_NUM 0
#define SIOD_GPIO_NUM 26
#define SIOC_GPIO_NUM 27
#define Y9_GPIO_NUM 35
#define Y8_GPIO_NUM 34
#define Y7_GPIO_NUM 39
#define Y6_GPIO_NUM 36
#define Y5_GPIO_NUM 21
#define Y4_GPIO_NUM 19
#define Y3_GPIO_NUM 18
#define Y2_GPIO_NUM 5
#define VSYNC_GPIO_NUM 25
#define HREF_GPIO_NUM 23
#define PCLK_GPIO_NUM 22
bool initCamera() {
camera_config_t config;
config.ledc_channel = LEDC_CHANNEL_0;
config.ledc_timer = LEDC_TIMER_0;
config.pin_d0 = Y2_GPIO_NUM;
config.pin_d1 = Y3_GPIO_NUM;
config.pin_d2 = Y4_GPIO_NUM;
config.pin_d3 = Y5_GPIO_NUM;
config.pin_d4 = Y6_GPIO_NUM;
config.pin_d5 = Y7_GPIO_NUM;
config.pin_d6 = Y8_GPIO_NUM;
config.pin_d7 = Y9_GPIO_NUM;
config.pin_xclk = XCLK_GPIO_NUM;
config.pin_pclk = PCLK_GPIO_NUM;
config.pin_vsync = VSYNC_GPIO_NUM;
config.pin_href = HREF_GPIO_NUM;
config.pin_sscb_sda = SIOD_GPIO_NUM;
config.pin_sscb_scl = SIOC_GPIO_NUM;
config.pin_pwdn = PWDN_GPIO_NUM;
config.pin_reset = RESET_GPIO_NUM;
config.xclk_freq_hz = 20000000;
config.pixel_format = PIXFORMAT_JPEG;
config.frame_size = FRAMESIZE_QVGA;
config.jpeg_quality = 10;
config.fb_count = 1;
esp_err_t result = esp_camera_init(&config);
if (result != ESP_OK) {
return false;
}
return true;
}
mtmn_config_t mtmn_config = {0};
int detections = 0;
void setup() {
Serial.begin(115200);
if (!initCamera()) {
Serial.printf("Failed to initialize camera...");
return;
}
mtmn_config = mtmn_init_config();
}
void loop() {
camera_fb_t * frame;
frame = esp_camera_fb_get();
dl_matrix3du_t *image_matrix = dl_matrix3du_alloc(1, frame->width, frame->height, 3);
fmt2rgb888(frame->buf, frame->len, frame->format, image_matrix->item);
esp_camera_fb_return(frame);
box_array_t *boxes = face_detect(image_matrix, &mtmn_config);
if (boxes != NULL) {
detections = detections+1;
Serial.printf("Faces detected %d times \n", detections);
dl_lib_free(boxes->score);
dl_lib_free(boxes->box);
dl_lib_free(boxes->landmark);
dl_lib_free(boxes);
}
dl_matrix3du_free(image_matrix);
}
Testing the code
To test the code, compile it and upload it to your ESP32, making sure it is correctly connected to the camera. Once the procedure finishes, open the Arduino IDE serial monitor.
After that, point the camera to your face. You should see something similar to figure 1. As can be seen, a face is being detected in the captured images. Note that, for my case, I had the camera pointed to myself for a while, which is why the image shows so many times a face was detected.

I’ve tested the program only with one face in front of the camera and with good lighting conditions. Naturally, your results may vary depending on the setup you have and on the lighting conditions.
Also, make sure that you are pointing the camera correctly at your face. Since there is no visual feedback of the image being captured, the first time I’ve tried I was not obtaining any results since I had the camera upside down.
References
[1] https://github.com/espressif/esp-who
dl_lib_free(boxes->score);
dl_lib_free(boxes->box);
dl_lib_free(boxes->landmark);
dl_lib_free(boxes);
gives error symbol not found(dl_lib_free). I commented them out but I will suspect it will memory leaks without freeing resources.
Hi,
Could you tell me the max number of faces can ESP32-CAM detect in a frame simultaneously?
From Arduino ESP32 example…
free(net_boxes->score);
free(net_boxes->box);
free(net_boxes->landmark);
free(net_boxes);
I’m seeing the same issue…gives error symbol not found(dl_lib_free).
Your documentation its very good, well explained and simple, thank you very very much
Can we detect multiple faces with esp
thank you for the useful documentation.
i got same error regarding free lib.
after detecting the face , i want to create Arduino code to get data from esp32 regarding detecting face and then apply few stories on lcd and others.
can u help?
Hi, thanks for the nice shown.
My compiling got error of: ‘dl_lib_free’ was not declared in this scope
does this sketch has ESP32 version problem?
Best
hi, I can’t compile the code, it shows fd_forward.h not found, I am running it in Arduino IDE 2.1.0 but I also tried in 1.8.19, and it gives the same error.
How do I get it running?