Efficient Methods for Learning Visual Multi-object 6D Pose Estimation and Tracking

Periyasamy, Arul Selvam

Volltext

View/Open (26.4MB)

Author

Periyasamy, Arul Selvam

ORCID

https://orcid.org/0000-0002-9320-3928

Type of Scholarly Publication

Dissertation

Date of Exam

22.01.2026

Date of Publication

06.02.2026

Advisor

Behnke, Sven

Co-Referee

Stückler, Jörg-Dieter

Degree Granting Institutions

Rheinische Friedrich-Wilhelms-Universität Bonn

Metadata

Show full item record

Citable Links

Handle: https://hdl.handle.net/20.500.11811/13878
URN: https://nbn-resolving.org/urn:nbn:de:hbz:5-87681
DOI: https://doi.org/10.48565/bonndoc-780

Abstract

Object detection and 6D object pose estimation are foundational components in a visual scene understanding system. Despite the intertwined nature of these tasks, the standard methods for scene understanding decouple object detection and pose estimation. They perform object detection in the first stage and process only the crops containing the target object to estimate pose parameters. In this thesis, we present an alternate approach called multi-object pose estimation, in which we perform joint object detection and pose estimation in a single step for all the objects in the scene. We formulate multi-object pose estimation as a set prediction problem. We utilize the permutation invariant nature of the recent Transformer architecture to generate a set of object predictions for a given single-view RGB image. Our model achieves accuracy comparable to the state-of-the-art models while being significantly faster. Video sequences contain rich temporal information that offer additional context than single-view images. To take advantage of the temporal information contained in the video sequences, we develop an enhanced version of our multi-object pose estimation model by incorporating temporal fusion modules and demonstrate improved accuracy in both pose estimation and object detection. In general, datasets are crucial for learning-based perception methods. The most commonly used datasets for object-centric scene understanding feature static scenes. To enable dynamic scene understanding, we introduce a photo- and physically-realistic dataset featuring simulations of commonly occurring bin-picking scenarios. We use this dataset to evaluate the temporal fusion approach we present in this thesis. Moreover, ability to refine less accurate pose predictions is an important attribute in building robust scene understanding systems. We introduce pose and shape parameter refinement pipelines based on iterative render and compare optimization. However, comparing rendered and observed images in the RGB color space is error-prone. Thus, we propose image comparison in learned feature space that are invariant to secondary lighting effects. To facilitate time efficient iterative refinement, we develop a lightweight differentiable renderer. Furthermore, real-world objects exhibit symmetry. The standard pose estimation models are designed to estimate a single plausible pose among a set of symmetrical poses. Thus, they are not suitable for inferring symmetry. To this end, we model object symmetries using implicit probability density networks and present an automatic ground-truth annotation scheme to train such implicit networks without the need for manual symmetry label annotations.

Subjects

Roboterwahrnehmung, Objektposenschätzung, Objekterkennung, semantische Segmentierung, Robotic Perception, Object Pose Estimation, Object Detection, Semantic Segmentation

Classification (DDC)

004 Informatik

Related Publications

https://doi.org/10.1109/Humanoids43949.2019.9035024
https://doi.org/10.1109/CASE49439.2021.9551599
https://doi.org/10.1007/978-3-030-92659-5_34
https://doi.org/10.5220/0010817100003124
https://doi.org/10.1109/IRC55401.2022.00044
https://doi.org/10.1016/j.robot.2023.104490
https://doi.org/10.1109/IRC59093.2023.00047
https://doi.org/10.1109/ICRA57147.2024.10610674
https://doi.org/10.48550/arXiv.2205.02536

Zitiervorschlag
BibTeX

Periyasamy, Arul Selvam: Efficient Methods for Learning Visual Multi-object 6D Pose Estimation and Tracking. - Bonn, 2026. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-87681

@phdthesis{handle:20.500.11811/13878,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-87681,
doi: https://doi.org/10.48565/bonndoc-780,
author = {{Arul Selvam Periyasamy}},
title = {Efficient Methods for Learning Visual Multi-object 6D Pose Estimation and Tracking},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2026,
month = feb,
note = {Object detection and 6D object pose estimation are foundational components in a visual scene understanding system. Despite the intertwined nature of these tasks, the standard methods for scene understanding decouple object detection and pose estimation. They perform object detection in the first stage and process only the crops containing the target object to estimate pose parameters. In this thesis, we present an alternate approach called multi-object pose estimation, in which we perform joint object detection and pose estimation in a single step for all the objects in the scene. We formulate multi-object pose estimation as a set prediction problem. We utilize the permutation invariant nature of the recent Transformer architecture to generate a set of object predictions for a given single-view RGB image. Our model achieves accuracy comparable to the state-of-the-art models while being significantly faster. Video sequences contain rich temporal information that offer additional context than single-view images. To take advantage of the temporal information contained in the video sequences, we develop an enhanced version of our multi-object pose estimation model by incorporating temporal fusion modules and demonstrate improved accuracy in both pose estimation and object detection. In general, datasets are crucial for learning-based perception methods. The most commonly used datasets for object-centric scene understanding feature static scenes. To enable dynamic scene understanding, we introduce a photo- and physically-realistic dataset featuring simulations of commonly occurring bin-picking scenarios. We use this dataset to evaluate the temporal fusion approach we present in this thesis. Moreover, ability to refine less accurate pose predictions is an important attribute in building robust scene understanding systems. We introduce pose and shape parameter refinement pipelines based on iterative render and compare optimization. However, comparing rendered and observed images in the RGB color space is error-prone. Thus, we propose image comparison in learned feature space that are invariant to secondary lighting effects. To facilitate time efficient iterative refinement, we develop a lightweight differentiable renderer. Furthermore, real-world objects exhibit symmetry. The standard pose estimation models are designed to estimate a single plausible pose among a set of symmetrical poses. Thus, they are not suitable for inferring symmetry. To this end, we model object symmetries using implicit probability density networks and present an automatic ground-truth annotation scheme to train such implicit networks without the need for manual symmetry label annotations.},
url = {https://hdl.handle.net/20.500.11811/13878}
}

The following license files are associated with this item: