Döring, Andreas: Multi-Person Pose Tracking in Videos. - Bonn, 2025. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-81032
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-81032
@phdthesis{handle:20.500.11811/12916,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-81032,
author = {{Andreas Döring}},
title = {Multi-Person Pose Tracking in Videos},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2025,
month = mar,
note = {This thesis addresses the complex challenges of multi-person pose estimation and tracking in dynamic, real-world environments. Motivated by advancements in artificial intelligence and the growing need for natural human-machine interaction, this research focuses on developing robust computer vision systems that can accurately estimate and track human poses. These systems have broad applications, ranging from healthcare and surveillance to autonomous driving, sports analysis, and entertainment. However, real-world scenarios, especially those involving multiple interacting individuals, frequently result in keypoint inaccuracies due to occlusions and overlapping body parts, as seen in sports where highly articulated poses are common. Effective pose tracking requires robust occlusion handling and the ability to preserve identities, even across videos.
Previous approaches formulated multi-person pose tracking as spatiotemporal graph partitioning problems, which are computationally expensive and unsuitable for real-time applications. More recent methods using flow-, motion-, or appearance-based assignments struggle when individuals are occluded, and appearance-based methods face challenges with partial occlusions and visually similar individuals. Additionally, existing datasets for multi-person pose tracking are sparsely annotated and lack identity information.
This research tackles several key challenges, including occlusions, motion blur, crowded scenes, and varying lighting conditions, which heavily impact pose estimation and tracking accuracy. We propose a top-down multi-person pose estimation and tracking framework, with the core contribution being a self-supervised keypoint correspondence network that recovers missed detections, associates keypoints across video frames, and restores broken tracks due to occlusions. Moreover, recognizing the limitations of existing datasets, we introduce PoseTrack21 - a densely annotated dataset designed to improve training and evaluation in multi-person pose tracking, multi-object tracking, and person search. PoseTrack21 includes comprehensive annotations for keypoints, bounding boxes, and person identities, enabling joint evaluation across these tasks. Additionally, we present a Gated Attention Transformer model that integrates appearance and keypoint-based similarities through a novel gating mechanism, improving the robustness of pose tracking systems. This model includes a matching layer to eliminate redundant detections and a pose-conditioned re-identification network to handle occlusions, enhancing tracking consistency.
Our proposed methods are extensively evaluated on the PoseTrack 2017, 2018, and PoseTrack21 datasets, demonstrating their effectiveness. The contributions of this thesis establish a foundation for more efficient multi-person pose tracking systems, advancing the state-of-the-art in this crucial area of computer vision.},
url = {https://hdl.handle.net/20.500.11811/12916}
}
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-81032,
author = {{Andreas Döring}},
title = {Multi-Person Pose Tracking in Videos},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2025,
month = mar,
note = {This thesis addresses the complex challenges of multi-person pose estimation and tracking in dynamic, real-world environments. Motivated by advancements in artificial intelligence and the growing need for natural human-machine interaction, this research focuses on developing robust computer vision systems that can accurately estimate and track human poses. These systems have broad applications, ranging from healthcare and surveillance to autonomous driving, sports analysis, and entertainment. However, real-world scenarios, especially those involving multiple interacting individuals, frequently result in keypoint inaccuracies due to occlusions and overlapping body parts, as seen in sports where highly articulated poses are common. Effective pose tracking requires robust occlusion handling and the ability to preserve identities, even across videos.
Previous approaches formulated multi-person pose tracking as spatiotemporal graph partitioning problems, which are computationally expensive and unsuitable for real-time applications. More recent methods using flow-, motion-, or appearance-based assignments struggle when individuals are occluded, and appearance-based methods face challenges with partial occlusions and visually similar individuals. Additionally, existing datasets for multi-person pose tracking are sparsely annotated and lack identity information.
This research tackles several key challenges, including occlusions, motion blur, crowded scenes, and varying lighting conditions, which heavily impact pose estimation and tracking accuracy. We propose a top-down multi-person pose estimation and tracking framework, with the core contribution being a self-supervised keypoint correspondence network that recovers missed detections, associates keypoints across video frames, and restores broken tracks due to occlusions. Moreover, recognizing the limitations of existing datasets, we introduce PoseTrack21 - a densely annotated dataset designed to improve training and evaluation in multi-person pose tracking, multi-object tracking, and person search. PoseTrack21 includes comprehensive annotations for keypoints, bounding boxes, and person identities, enabling joint evaluation across these tasks. Additionally, we present a Gated Attention Transformer model that integrates appearance and keypoint-based similarities through a novel gating mechanism, improving the robustness of pose tracking systems. This model includes a matching layer to eliminate redundant detections and a pose-conditioned re-identification network to handle occlusions, enhancing tracking consistency.
Our proposed methods are extensively evaluated on the PoseTrack 2017, 2018, and PoseTrack21 datasets, demonstrating their effectiveness. The contributions of this thesis establish a foundation for more efficient multi-person pose tracking systems, advancing the state-of-the-art in this crucial area of computer vision.},
url = {https://hdl.handle.net/20.500.11811/12916}
}