Python MediaPipe: real-time hand tracking and landmarks estimation

Introduction

In this tutorial we will learn how to perform real-time hand tracking and landmarks estimation using Python, OpenCV and MediaPipe. We will be reading the video from a webcam using OpenCV and perform the hand tracking and landmarks estimation using MediaPipe and its Hands solution.

The code we are going to cover here is the continuation of the tutorial where we have learned how to perform detection and landmarks estimation of hands on a static image (link here).

This tutorial was tested on Windows 8.1, with version 4.1.2 of OpenCV and version 0.8.3.1 of MediaPipe (alpha version). The Python version used was 3.7.2.

Real-time hand tracking and landmark estimation

In this first section, we are going to check the basics on how to grab video from a camera and perform the hand tracking and landmarks estimation.

We will start our code by importing the cv2 and the mediapipe modules. The cv2 module will allow us to obtain the frames from the camera, over which we will perform the hand landmarks estimation by using the functionality exposed by the mediapipe module.

For convenience, we will also access the drawing_utils and hands modules from mediapipe, so we don’t need to use the complete path every time we want to use a functionality they expose.

import cv2
import mediapipe

drawingModule = mediapipe.solutions.drawing_utils
handsModule = mediapipe.solutions.hands

After this, to access the video from the camera, we will create an object of class VideoCapture. The constructor of this class receives as input the index of the camera we want to access. If we have a single camera, we can pass the value 0.

capture = cv2.VideoCapture(0)

Next we will instantiate an object of class Hands, which we will use to perform the hand tracking and landmarks estimation. We will make use of the optional parameters of the constructor:

  • static_image_mode: Indicates if the input images should be treated as independent and non related (True) or should be treated as a video stream (False). We are going to set the value to False, which means that, after a successful detection of hands in the video frame, the algorithm will localize the landmarks and, in subsequent frames, it will simply track the landmarks without invoking another detection, until it loses track of any of the hands.
  • max_num_hands: Maximum number of hands to be detected. Although this parameter defaults to 2, we will explicitly set this field to the same value, to illustrate its usage.
  • min_detection_confidence: Minimum confidence value (between 0 and 1) for the hand detection to be considered successful. We will set it to 0.7.
  • min_tracking_confidence: Minimum confidence value (between 0 and 1) for the hand landmarks to be considered tracked successfully. We will set it to 0.7.

We will wrap the creation of the object on a with statement, to guarantee the resources are freed after we no longer need it.

with handsModule.Hands(static_image_mode=False, min_detection_confidence=0.7, min_tracking_confidence=0.7, max_num_hands=2) as hands:

Now we will take care of getting frames from the camera. We will do it inside a loop that will break if the user presses the ESC key.

 while (True):

     # Hand landmarks estimation on each frame

     if cv2.waitKey(1) == 27:
         break

To get a frame, we call the read method on our VideoCapture object. This method receives no arguments and returns a tuple. The first tuple value is a Boolean indicating if the frame was read correctly or not (we could use it for error checking, although in this tutorial we won’t do it to keep the code short). The second tuple value is the frame, represented as a numpy ndarray.

ret, frame = capture.read()

Now, to perform the hand landmarks estimation, we simply need to call to the process method on our Hands object. One small detail that we need to consider is that this method expects an image in RGB format but OpenCV frames obtained in the previous call are returned in BGR format. As such, we will perform the conversion of the color space from BGR to RGB, and pass the resulting image to the process method.

results = hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

If any hand was detected, we will then iterate through each detection and use the draw_landmarks function from the drawing_utils method to draw the landmarks in the image. Note that we will draw the landmarks over the BGR image we originally obtained because we are going to display it in an OpenCV window, which expects this format.

if results.multi_hand_landmarks != None:
    for handLandmarks in results.multi_hand_landmarks:
        drawingModule.draw_landmarks(frame, handLandmarks, handsModule.HAND_CONNECTIONS)

Then, we will draw the image in a window.

cv2.imshow('Test hand', frame)

The whole loop for reading and processing the frames can be seen below.

with handsModule.Hands(static_image_mode=False, min_detection_confidence=0.7, min_tracking_confidence=0.7, max_num_hands=2) as hands:

    while (True):

        ret, frame = capture.read()
        results = hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

        if results.multi_hand_landmarks != None:
            for handLandmarks in results.multi_hand_landmarks:
                drawingModule.draw_landmarks(frame, handLandmarks, handsModule.HAND_CONNECTIONS)

        cv2.imshow('Test hand', frame)

        if cv2.waitKey(1) == 27:
            break

After the loop breaks, we are going to destroy the window and release the camera.

cv2.destroyAllWindows()
capture.release()

The complete code can be seen below.

import cv2
import mediapipe

drawingModule = mediapipe.solutions.drawing_utils
handsModule = mediapipe.solutions.hands

capture = cv2.VideoCapture(0)

with handsModule.Hands(static_image_mode=False, min_detection_confidence=0.7, min_tracking_confidence=0.7, max_num_hands=2) as hands:

    while (True):

        ret, frame = capture.read()
        results = hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

        if results.multi_hand_landmarks != None:
            for handLandmarks in results.multi_hand_landmarks:
                drawingModule.draw_landmarks(frame, handLandmarks, handsModule.HAND_CONNECTIONS)

        cv2.imshow('Test hand', frame)

        if cv2.waitKey(1) == 27:
            break

cv2.destroyAllWindows()
capture.release()

To test the code, simply run it in a tool at your choice. In my case, I’ll be using PyCharm, a Python IDE. Also, don’t forget to make sure you have a webcam connected to your computer.

You should obtain a result similar to what can be seen in figure 1 below. As expected, our program is able to identify and track a hand in a video, and also to display its 21 landmarks.

Real-time hand tracking and landmarks estimation.
Figure 1 – Real-time hand tracking and landmarks estimation.

Accessing the landmarks

In the previous section, we simply grabbed the results of the hands detection / landmarks estimation and passed them to a MediaPipe function that did the whole job of drawing them in each frame. Nonetheless, for a real application, we will most likely need to access the landmark points and do something with them. For illustration purposes, we will check how to iterate through all the points and display them as circles, using an OpenCV drawing function instead of MediaPipe’s function.

The initial part of the code will be like in the previous section: importing the modules needed and then creating a VideoCapture object.

import cv2
import mediapipe

drawingModule = mediapipe.solutions.drawing_utils
handsModule = mediapipe.solutions.hands

capture = cv2.VideoCapture(0)

After that, we will get the width and the height of the frames that are obtained from the camera, which we will later need to convert the landmark normalized coordinates to pixel coordinates.

frameWidth = capture.get(cv2.CAP_PROP_FRAME_WIDTH)
frameHeight = capture.get(cv2.CAP_PROP_FRAME_HEIGHT)

After that we will focus our attention in the loop where we will read the frames and estimate the hand landmarks.

with handsModule.Hands(static_image_mode=False, min_detection_confidence=0.7, min_tracking_confidence=0.7, max_num_hands=2) as hands:

    while (True):

        ret, frame = capture.read()

        results = hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

        if results.multi_hand_landmarks != None:
            for handLandmarks in results.multi_hand_landmarks:
                # draw circle for each keypoint

For each hand detected, we will iterate through the HandLandmark enumerated value of the hands module, which will allow us to identify each landmark on the list.

for point in handsModule.HandLandmark:
    # draw circle for the keypoint

Then, for each point of the enum, we will obtain the corresponding landmark and convert from normalized values to pixel coordinates. The MediaPipe drawing_utils module exposes a function called _normalized_to_pixel_coordinates that allows to do this conversion. As output, the function will return a tuple with the x and y pixel coordinates of the landmark.

normalizedLandmark = handLandmarks.landmark[point]
pixelCoordinatesLandmark = drawingModule._normalized_to_pixel_coordinates(normalizedLandmark.x, normalizedLandmark.y, frameWidth, frameHeight)

Now that we have the pixel coordinates, we will draw a circle centered on the landmark point. This is done with a call to the circle function from the cv2 module, which receives the following inputs (check a detailed tutorial on how to draw circles here):

  • The image where to draw the circle. We will pass the current frame we are processing.
  • A tuple with the x and y coordinates of the center of the circle. We will pass the tuple with the coordinates of the landmark.
  • The radius of the circle, in pixels. We will pass a value of 5.
  • A tuple with the color of the circle, in BGR (Blue, Green and Red) format. We will set the color to green.
  • The thickness of the circle outline. We will pass the value -1, so the circle is filled.
cv2.circle(frame, pixelCoordinatesLandmark, 5, (0, 255, 0), -1)

After iterating each hand and drawing each landmark, we will display the image and then check if the user has clicked on the ESC key, to break the loop. The complete code is shown below. It already includes destroying the window and releasing the camera at the end.

import cv2
import mediapipe

drawingModule = mediapipe.solutions.drawing_utils
handsModule = mediapipe.solutions.hands

capture = cv2.VideoCapture(0)

frameWidth = capture.get(cv2.CAP_PROP_FRAME_WIDTH)
frameHeight = capture.get(cv2.CAP_PROP_FRAME_HEIGHT)


with handsModule.Hands(static_image_mode=False, min_detection_confidence=0.7, min_tracking_confidence=0.7, max_num_hands=2) as hands:

    while (True):

        ret, frame = capture.read()

        results = hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

        if results.multi_hand_landmarks != None:
            for handLandmarks in results.multi_hand_landmarks:
                for point in handsModule.HandLandmark:

                    normalizedLandmark = handLandmarks.landmark[point]
                    pixelCoordinatesLandmark = drawingModule._normalized_to_pixel_coordinates(normalizedLandmark.x, normalizedLandmark.y, frameWidth, frameHeight)

                    cv2.circle(frame, pixelCoordinatesLandmark, 5, (0, 255, 0), -1)


        cv2.imshow('Test hand', frame)

        if cv2.waitKey(1) == 27:
            break

cv2.destroyAllWindows()
capture.release()

Like before, simply run the Python code from the previous snippet. You should get a result similar to figure 2. As can be seen, the landmarks are being drawn as small green circles, as expected.

Representing the hand landmarks as small circles.
Figure 2 – Representing the hand landmarks as small circles.

Suggested Readings

Leave a Reply

%d bloggers like this: