Keypoint Detection

Keypoint detection locates specific points of interest on objects or bodies within an image. For human pose estimation, keypoints mark joint locations like shoulders, elbows, wrists, hips, knees, and ankles. For faces, they mark eyes, nose tip, mouth corners, and jawline. For objects, they can mark functional parts like door handles, wheel centers, or component attachment points.

Architectures for keypoint detection typically produce heatmaps, one per keypoint type, where bright spots indicate the predicted location of each point. Top-down methods first detect objects with bounding boxes, then estimate keypoints within each box (HRNet, ViTPose). Bottom-up methods detect all keypoints in the image at once, then group them into individual instances (OpenPose, HigherHRNet). Top-down approaches are generally more accurate but slower because they run the keypoint estimator once per detected object.

Applications include human pose estimation for sports analytics and fitness tracking, hand keypoint detection for gesture recognition and sign language, facial landmark detection for face alignment and expression analysis, animal pose estimation for wildlife monitoring and veterinary science, and industrial keypoint detection for measuring component positions in manufacturing assembly. MediaPipe provides lightweight real-time keypoint models for hands, faces, and full bodies on mobile devices.

Resources

Relevant Blog Posts ↘

Glossary

Our Blog

Documentation

What Is Pose Estimation? Keypoint Detection Explained [2026]

MIN READ

April 2, 2026

Pose estimation predicts anatomical keypoints (e.g., shoulders, elbows, knees) and connects them into a skeleton, revealing posture and motion rather than just “there’s a person here.” In 2026 it’s mature enough for real-time edge use, with top-down vs bottom-up multi-person pipelines, heatmap/SimCC-style localization, and standard evaluation via OKS-based AP.

Read

How to Perform Action Recognition on Keypoints with ST-GCN++

MIN READ

March 4, 2026

Action recognition is a computer vision task aimed at identifying human actions in visual data, using machine learning techniques to analyze motion and appearance patterns. This field is distinct from traditional classification, focusing on temporal dynamics in videos and has applications in surveillance, healthcare, sports analysis, and more.

Read

Get Started Now

Get Started using Datature’s computer vision platform now for free.

Book Demo