3D Hand Pose Estimation from Single RGB Images with Auxiliary Information

Yang, Linlin

Volltext

View/Open (19.7MB)

Author

Yang, Linlin

Type of Scholarly Publication

Dissertation

Date of Exam

23.11.2022

Date of Publication

14.12.2022

Advisor

Yao, Angela

Co-Referee

Klein, Reinhard

Degree Granting Institutions

Rheinische Friedrich-Wilhelms-Universität Bonn

Metadata

Show full item record

Citable Links

Handle: https://hdl.handle.net/20.500.11811/10521
URN: https://nbn-resolving.org/urn:nbn:de:hbz:5-68952

Abstract

3D hand pose estimation from monocular RGB inputs is critical for augmented and virtual reality applications, and has achieved remarkable progress due to the revolution of deep learning. Existing deep-learning-based hand pose estimation systems target learning good representations for hand poses, requiring a large amount of accurate ground truth labels, which are difficult to obtain. We turn to explore different auxiliary information to aid representation learning and reduce the reliance on data annotation. This dissertation explores different auxiliary information, i.e., image factors, multi-modal data, and synthetic data, for 3D hand pose estimation.
Motivated by the image rendering that requires a number of image factors of variation, we propose to learn disentangled representations to better analyze these factors of variation. The disentangled representations enable explicit control over different factors of variation for synthesizing hand images and training with hand factors as weak labels for hand pose estimation. Besides labelled or shared hand factors, different modalities (e.g., RGB images and depth maps) of the same hand should have shared information. Therefore, we present multi-modalities as auxiliary information for RGB inputs. Specifically, we explore multi-modal alignment in three aspects: latent space alignment based on variational autoencoder and product of Gaussian expert, pixel-level alignment via attention fusion, and low-dimensional subspace alignment via contrastive learning. Besides multi-modal alignment, the auxiliary modalities can also serve as weak labels for hand pose estimation.
To further remove the requirements of image factors or different modalities, we emphasize the importance of synthetic data. Synthetic data is flexible, infinite, and easy to achieve. With synthetic data as auxiliary information, we can significantly reduce the number of labelled real-world data. Therefore, we introduce a challenging scenario that learns only from labelled synthetic data and fully unlabelled real-world data. To address this challenging scenario, we present a semi-supervised framework with pseudo-labelling and consistency training, and try to address noisy pseudo-labels using modules like label correction and self-distillation.
This dissertation advances the state-of-the-art 3D hand pose estimation, explores representation learning, weakly- and semi-supervised learning for pose estimation, and paves a path forward for learning pose estimation with diverse auxiliary information.

Subjects

3D Hand Pose Estimation, Weakly-Supervised Learning, Semi-Supervised Learning, Multi-Modal Learning, Deep Learning, Computer Vision

Classification (DDC)

004 Informatik

Related Publications

https://doi.org/10.1109/CVPR.2019.01011
https://doi.org/10.1109/ICCV.2019.00242
https://doi.org/10.1109/ICCV48922.2021.01117

Zitiervorschlag
BibTeX

Yang, Linlin: 3D Hand Pose Estimation from Single RGB Images with Auxiliary Information. - Bonn, 2022. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-68952

@phdthesis{handle:20.500.11811/10521,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-68952,
author = {{Linlin Yang}},
title = {3D Hand Pose Estimation from Single RGB Images with Auxiliary Information},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2022,
month = dec,
note = {3D hand pose estimation from monocular RGB inputs is critical for augmented and virtual reality applications, and has achieved remarkable progress due to the revolution of deep learning. Existing deep-learning-based hand pose estimation systems target learning good representations for hand poses, requiring a large amount of accurate ground truth labels, which are difficult to obtain. We turn to explore different auxiliary information to aid representation learning and reduce the reliance on data annotation. This dissertation explores different auxiliary information, i.e., image factors, multi-modal data, and synthetic data, for 3D hand pose estimation.
Motivated by the image rendering that requires a number of image factors of variation, we propose to learn disentangled representations to better analyze these factors of variation. The disentangled representations enable explicit control over different factors of variation for synthesizing hand images and training with hand factors as weak labels for hand pose estimation. Besides labelled or shared hand factors, different modalities (e.g., RGB images and depth maps) of the same hand should have shared information. Therefore, we present multi-modalities as auxiliary information for RGB inputs. Specifically, we explore multi-modal alignment in three aspects: latent space alignment based on variational autoencoder and product of Gaussian expert, pixel-level alignment via attention fusion, and low-dimensional subspace alignment via contrastive learning. Besides multi-modal alignment, the auxiliary modalities can also serve as weak labels for hand pose estimation.
To further remove the requirements of image factors or different modalities, we emphasize the importance of synthetic data. Synthetic data is flexible, infinite, and easy to achieve. With synthetic data as auxiliary information, we can significantly reduce the number of labelled real-world data. Therefore, we introduce a challenging scenario that learns only from labelled synthetic data and fully unlabelled real-world data. To address this challenging scenario, we present a semi-supervised framework with pseudo-labelling and consistency training, and try to address noisy pseudo-labels using modules like label correction and self-distillation.
This dissertation advances the state-of-the-art 3D hand pose estimation, explores representation learning, weakly- and semi-supervised learning for pose estimation, and paves a path forward for learning pose estimation with diverse auxiliary information.},
url = {https://hdl.handle.net/20.500.11811/10521}
}

The following license files are associated with this item: