Show simple item record

Efficient Methods for Learning Visual Multi-object 6D Pose Estimation and Tracking

dc.contributor.advisorBehnke, Sven
dc.contributor.authorPeriyasamy, Arul Selvam
dc.date.accessioned2026-02-06T09:49:50Z
dc.date.available2026-02-06T09:49:50Z
dc.date.issued06.02.2026
dc.identifier.urihttps://hdl.handle.net/20.500.11811/13878
dc.description.abstractObject detection and 6D object pose estimation are foundational components in a visual scene understanding system. Despite the intertwined nature of these tasks, the standard methods for scene understanding decouple object detection and pose estimation. They perform object detection in the first stage and process only the crops containing the target object to estimate pose parameters. In this thesis, we present an alternate approach called multi-object pose estimation, in which we perform joint object detection and pose estimation in a single step for all the objects in the scene. We formulate multi-object pose estimation as a set prediction problem. We utilize the permutation invariant nature of the recent Transformer architecture to generate a set of object predictions for a given single-view RGB image. Our model achieves accuracy comparable to the state-of-the-art models while being significantly faster. Video sequences contain rich temporal information that offer additional context than single-view images. To take advantage of the temporal information contained in the video sequences, we develop an enhanced version of our multi-object pose estimation model by incorporating temporal fusion modules and demonstrate improved accuracy in both pose estimation and object detection. In general, datasets are crucial for learning-based perception methods. The most commonly used datasets for object-centric scene understanding feature static scenes. To enable dynamic scene understanding, we introduce a photo- and physically-realistic dataset featuring simulations of commonly occurring bin-picking scenarios. We use this dataset to evaluate the temporal fusion approach we present in this thesis. Moreover, ability to refine less accurate pose predictions is an important attribute in building robust scene understanding systems. We introduce pose and shape parameter refinement pipelines based on iterative render and compare optimization. However, comparing rendered and observed images in the RGB color space is error-prone. Thus, we propose image comparison in learned feature space that are invariant to secondary lighting effects. To facilitate time efficient iterative refinement, we develop a lightweight differentiable renderer. Furthermore, real-world objects exhibit symmetry. The standard pose estimation models are designed to estimate a single plausible pose among a set of symmetrical poses. Thus, they are not suitable for inferring symmetry. To this end, we model object symmetries using implicit probability density networks and present an automatic ground-truth annotation scheme to train such implicit networks without the need for manual symmetry label annotations.en
dc.language.isoeng
dc.rightsNamensnennung 4.0 International
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectRoboterwahrnehmung
dc.subjectObjektposenschätzung
dc.subjectObjekterkennung
dc.subjectsemantische Segmentierung
dc.subjectRobotic Perception
dc.subjectObject Pose Estimation
dc.subjectObject Detection
dc.subjectSemantic Segmentation
dc.subject.ddc004 Informatik
dc.titleEfficient Methods for Learning Visual Multi-object 6D Pose Estimation and Tracking
dc.typeDissertation oder Habilitation
dc.identifier.doihttps://doi.org/10.48565/bonndoc-780
dc.publisher.nameUniversitäts- und Landesbibliothek Bonn
dc.publisher.locationBonn
dc.rights.accessRightsopenAccess
dc.identifier.urnhttps://nbn-resolving.org/urn:nbn:de:hbz:5-87681
dc.relation.doihttps://doi.org/10.1109/Humanoids43949.2019.9035024
dc.relation.doihttps://doi.org/10.1109/CASE49439.2021.9551599
dc.relation.doihttps://doi.org/10.1007/978-3-030-92659-5_34
dc.relation.doihttps://doi.org/10.5220/0010817100003124
dc.relation.doihttps://doi.org/10.1109/IRC55401.2022.00044
dc.relation.doihttps://doi.org/10.1016/j.robot.2023.104490
dc.relation.doihttps://doi.org/10.1109/IRC59093.2023.00047
dc.relation.doihttps://doi.org/10.1109/ICRA57147.2024.10610674
dc.relation.doihttps://doi.org/10.48550/arXiv.2205.02536
ulbbn.pubtypeErstveröffentlichung
ulbbnediss.affiliation.nameRheinische Friedrich-Wilhelms-Universität Bonn
ulbbnediss.affiliation.locationBonn
ulbbnediss.thesis.levelDissertation
ulbbnediss.dissID8768
ulbbnediss.date.accepted22.01.2026
ulbbnediss.instituteMathematisch-Naturwissenschaftliche Fakultät : Fachgruppe Informatik / Institut für Informatik
ulbbnediss.fakultaetMathematisch-Naturwissenschaftliche Fakultät
dc.contributor.coRefereeStückler, Jörg-Dieter
ulbbnediss.contributor.orcidhttps://orcid.org/0000-0002-9320-3928


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

The following license files are associated with this item:

Namensnennung 4.0 International