Python: Hand landmark estimation with MediaPipe

Introduction

In this tutorial we are going to learn how to obtain hand landmarks from an image, using Python, MediaPipe and OpenCV. We will be using OpenCV to read the image and displaying it and MediaPipe to perform the hand detection and landmark estimation.

In short, MediaPipe is a free and open-source framework that offers cross-platform, customizable Machine Learning solutions for live and streaming media [1]. You can check here the Machine Learning solutions available in MediaPipe and the supported platforms.

In the case of Python, MediaPipe is available as a prebuilt Python package [1]. To install the MediaPipe package using pip, a Python package manager, simply use the following command:

pip install mediapipe

Note however that not all the solutions from MediaPipe are available in the Python package. The hands landmark estimation is one of the available features and we are going to try it on this post.

You can read here a detailed explanation on how the hands landmark estimation process works. The code shown below is based on the samples from that documentation. One of the biggest advantages of the Python API for the hands landmark estimation is that it is really simply to use and hides from us the complexity of the Machine Learning models under the hood.

The model will return 21 landmarks per hand detected, where each landmark is composed of xy and z coordinates [2]. The x and y coordinates are normalized between 0 and 1 by the image width and height respectively. The z coordinate represents the landmark depth where the origin is the depth at the wrist [2]. Also, smaller z values mean that the landmark is closer to the camera [2].

This tutorial was tested on Windows 8.1, with version 4.1.2 of OpenCV and version 0.8.3.1 of MediaPipe. The Python version used was 3.7.2. Note that MediaPipe is still on alpha version, meaning that breaking changes in the API can still occur until a stable version is released [1].

Detecting hand landmarks

We will start our code by importing the cv2 module, which will allow us to read an image from the file system and display it, alongside the hand detection results, in a window. We will also import the mediapipe module, which will expose to us the functionality we need to do the estimation of the hands landmarks.

import cv2
import mediapipe

Next, we will access two sub modules from mediapipe, namely drawing_utils and hands. The drawing_utils module includes some useful helper functions to draw detections and landmarks over images, amongst other functionalities. The hands module contains the Hands class that we will use to perform the detection of hand landmarks on an image. We are doing this as a convenience, to avoid using the full path every time we want to access one of the functionalities of these modules.

drawingModule = mediapipe.solutions.drawing_utils
handsModule = mediapipe.solutions.hands

After this we are going to create an object of class Hands, so we can process the image. As input of the constructor of this class, we are going to pass the static_image_mode parameter as True, which indicates that the images it processes should be treated as unrelated, meaning that the hand detection should run on every input image [3]. If this parameter was set to False, then the images would be treated as a video stream, meaning that after a successful detection of hands, it localizes the hand landmarks and in subsequent image it simply tracks those landmarks without invoking another detection, until it loses track of any of the hands [3].

Note that the constructor supports some additional parameters that we will leave with the defaults. These parameters are the following [2]:

  • max_num_hands: Maximum number of hands to detect. Defaults to 2.
  • min_detection_confidence: Minimum confidence value (between 0 and 1) for the hand detection to be considered successful. Defaults to 0.5.
  • min_tracking_confidence: Minimum confidence value (between 0 and 1) for the hand landmarks to be considered tracked successfully. Defaults to 0.5.

We are going to wrap the creation of this object on a with statement, to ensure the resources are freed after the usage of the object. You can check the implementation of the __enter__ and __exit__ functions on the parent class of the Hands class, which is called SolutionBase.

with handsModule.Hands(static_image_mode=True) as hands:
    # Hands key points detection

Inside the with block we will read the image by calling the imread function from the cv2 module, passing as input a string with the path to the image. This will return the image as a Numpy ndarray.

 image = cv2.imread("C:/Users/N/Desktop/hand.jpg")

After that we are going to take care of the hand landmarks detection. We do this with a call to the process method on our Hands object. This method receives as input a ndarray with an image in RGB and returns as output a NamedTuple object containing a collection of hand landmarks for the hands found in the image and a collection of handedness of the detected hands (if each hand is a left or right hand) [4]. For this tutorial we won’t be analyzing the handedness.

It is important to take in consideration that the process method receives a RGB image but, when reading images with OpenCV, we obtain them in BGR format. As such, we fill convert our original image first to RGB, with a call to the cvtColor function from the cv2 module, and pass the result to the process method. Note that we need the image in RGB format just for the landmarks estimation process since we are going to display the result in a OpenCV window, which will display the image correctly in BGR format.

results = hands.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))

The landmarks estimated per each detected hand can be found on the multi_hand_landmarks field of the NamedTupled we just obtained. We will iterate through each element and draw the detected points on the image. Note that each element will contain the landmarks for that detected hand (so, you can think of the multi_hand_landmarks as an array of arrays).

We then draw the landmarks with a call to the draw_landmarks function from the drawing_utils module. As first input we pass the image where we want to draw the landmarks. As second input we pass the list of hand landmarks of the hand.

As third input of the draw_landmarks function we will pass an optional parameter that corresponds to a list that indicates how landmarks connect to each other, which will allow to draw these connections between them as lines. We will use this python frozenset exposed by the mediapipe library, which already contains all of those connections. If we don’t pass this list, then only the landmark points will be drawn on the image.

Note that this function call will mutate the original image we have loaded with OpenCV. If in some application we need to preserve it for some reason, we can simply do a copy of it, like covered on this previous tutorial.

for handLandmarks in results.multi_hand_landmarks:
      drawingModule.draw_landmarks(image, handLandmarks, handsModule.HAND_CONNECTIONS)

Now that we have drawn the landmarks on the image, to finalize the code we are going to display it in a OpenCV window.

 cv2.imshow('Test image', image)

 cv2.waitKey(0)
 cv2.destroyAllWindows()

The complete code can be seen below. Note that we have added a check to ensure that detections were found, before trying to iterate through them.

import cv2
import mediapipe

drawingModule = mediapipe.solutions.drawing_utils
handsModule = mediapipe.solutions.hands

with handsModule.Hands(static_image_mode=True) as hands:

    image = cv2.imread("C:/Users/N/Desktop/hand.jpg")

    results = hands.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))

    if results.multi_hand_landmarks != None:
        for handLandmarks in results.multi_hand_landmarks:
            drawingModule.draw_landmarks(image, handLandmarks, handsModule.HAND_CONNECTIONS)

    cv2.imshow('Test image', image)

    cv2.waitKey(0)
    cv2.destroyAllWindows()

To test the code simply run it using a tool of your choice. I’ll be using PyCharm, a Python IDE. Don’t forget to change the image path in the code above to point to a image containing a hand, stored in your computer.

You should get a result similar to figure 1. As can be seen, the 21 landmarks were drawn with lines connecting them, as expected. Naturally, the correctness of the results may vary depending on the images used and we should not expect 100% accuracy.

Hand landmarks detected on image.
Figure 1 – Hand landmarks detected on image.

Analyzing the hand landmarks

In the previous section we did not analyze the actual content of each landmark as we focused on the representation of the landmarks in an image. In this section, we will take a closer look at the results we have obtained. So, the detection code will be the same, but now we will do some analysis on the landmarks. In particular, we are going to identify each hand point of the model and print the normalized and pixel coordinates (x an y in the pixel coordinates case) of each landmark.

We are not going to analyze in detail the z coordinate as, at the time of writing, the documentation is a bit vague, indicating that “the magnitude of z uses roughly the same scale as x” [4]. Nonetheless, the normalized value can be interesting to analyze the hand position as it gives us an ideia about how each point is positioned in relation with the others, in terms of depth.

Looking into the actual code, like before, we start by the module imports and the creation of the Hands object, followed by reading the image and calling the process method to get the landmarks.

import cv2
import mediapipe

drawingModule = mediapipe.solutions.drawing_utils
handsModule = mediapipe.solutions.hands

with handsModule.Hands(static_image_mode=True) as hands:
    image = cv2.imread("C:/Users/N/Desktop/hand.jpg")
    results = hands.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
    
    # Analysis of the landmarks

After this we will obtain the dimensions of the image we have read, so we can convert from normalized x and y coordinates to pixel coordinates. For more details on how to obtain the dimensions of an image represented as a numpy array, please check here.

imageHeight, imageWidth, _ = image.shape

Like before, we are going to iterate through each detected hand, to leave our code ready for a multi-hand scenario.

for handLandmarks in results.multi_hand_landmarks:
    # process each hand list of landmarks

Since each hand is composed by a set of well known points (ex: the wrist, the tip of the thumb, etc…), the hands module has this enumerated value, called HandLandmark, containing the 21 hand landmarks indexes. You can check a visual representation of these landmarks as points here.

When looking into the list of landmarks for a given hand, these indexes are always respected so we know where to look for to find a specific part of the hand (ex: if we know that we have thumbs up and down hand images, we can possibly want to look to the position of the tip of the thumb in comparison to the wrist and identify if it is a thumbs up or down, which explains why it is important to know what each landmark refers to).

So, we are going to iterate through the HandLandmark enumerated value, which will allow us to identify the name of each point before printing its values.

for point in handsModule.HandLandmark:
   # process each landmark

To access the actual list of landmarks of the hand by index, we cannot directly use the handLandmarks variable. We need to access its landmark field instead.

normalizedLandmark = handLandmarks.landmark[point]

Now, that we have access to the normalized landmark (by default, landmarks are returned in their normalized format), we are going to convert it to pixel coordinates with a call to the _normalized_to_pixel_coordinates function from the drawing_utils module. It receives the following inputs:

  • Normalized x coordinate;
  • Normalized y coordinate;
  • Image width;
  • Image height.

This function returns as output a tuple with the x and y coordinates of the landmark, in pixels.

pixelCoordinatesLandmark = drawingModule._normalized_to_pixel_coordinates(normalizedLandmark.x, normalizedLandmark.y, imageWidth, imageHeight)

To finalize, we are going to print the name, the pixel coordinates and the normalized coordinates of the landmark.

print(point)
print(pixelCoordinatesLandmark)
print(normalizedLandmark)

The complete code is shown below.

import cv2
import mediapipe

drawingModule = mediapipe.solutions.drawing_utils
handsModule = mediapipe.solutions.hands

with handsModule.Hands(static_image_mode=True) as hands:

    image = cv2.imread("C:/Users/N/Desktop/hand.jpg")
    results = hands.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))

    imageHeight, imageWidth, _ = image.shape

    if results.multi_hand_landmarks != None:
        for handLandmarks in results.multi_hand_landmarks:
          for point in handsModule.HandLandmark:

              normalizedLandmark = handLandmarks.landmark[point]
              pixelCoordinatesLandmark = drawingModule._normalized_to_pixel_coordinates(normalizedLandmark.x, normalizedLandmark.y, imageWidth, imageHeight)

              print(point)
              print(pixelCoordinatesLandmark)
              print(normalizedLandmark)

Upon running the code, you should get a result similar to figure 2. As can be seen, each landmark information is printed.

Hand landmarks details.
Figure 2 – Hand landmarks details.

References

[1] https://google.github.io/mediapipe/

[2] https://google.github.io/mediapipe/solutions/hands.html

[3] https://google.github.io/mediapipe/solutions/hands#static_image_mode

[4] https://google.github.io/mediapipe/solutions/hands#output

1 thought on “Python: Hand landmark estimation with MediaPipe”

  1. i am having the issue please help me resolving it

    from mediapipe.python._framework_bindings import resource_util
    ImportError: DLL load failed: The specified module could not be found.

    And there is also issue in results.multi_hand_landmarks because function says (multi_hand_landmarks) is an unresolved attribute

Leave a Reply

%d bloggers like this: