Visual Failure Detection in Robotics

Thoduka, Santosh George

dc.contributor.advisor	Gall, Jürgen
dc.contributor.author	Thoduka, Santosh George
dc.date.accessioned	2026-03-31T14:51:25Z
dc.date.available	2026-03-31T14:51:25Z
dc.date.issued	31.03.2026
dc.identifier.uri	https://hdl.handle.net/20.500.11811/14054
dc.description.abstract	Autonomous robots in human-centric environments can encounter unforeseen situations, making them prone to task execution failures. These failures can lead to unsafe conditions for both humans and robots and result in a loss of trust in robots. Therefore, robots should have the ability to prevent, detect, and respond to failures. In this thesis, we focus on the detection of failures, particularly using video data from the robot's camera. Several visual failure detection datasets have emerged in recent years; however, there is still a scarcity of datasets suitable for building general-purpose failure detection approaches. Existing work has shown the benefit of using multimodal data for failure detection, but determining the best data representation and fusion methods remains an open challenge. Additionally, although most approaches develop failure detection models for specific tasks, they often fail to incorporate task knowledge into the learning process. In our work, we introduce multimodal datasets and explore multimodal learning and task knowledge integration to enhance failure detection performance. We contribute two visual failure detection datasets: the Bookshelf dataset and the Handover Failure Detection dataset, which consist of various failures that occur when the robot is placing a book on a shelf and performing object handovers with people. For the Bookshelf dataset, we use video and proprioceptive data to detect anomalous situations by comparing expected and observed motions. For the Handover dataset and a visual-tactile dataset, we find that intermediate fusion of video, force-torque, tactile and proprioceptive features performs best. For the Handover dataset, video was found to be an essential modality, and learning to predict the human's and robot's actions as auxiliary tasks is also beneficial. Finally, we explore incorporating task knowledge to improve failure classification performance. We show that pre-processing video frames using the known temporal boundaries of the robot’s actions and locations of objects in the scene improves results on a large-scale failure dataset, and a variable frame rate-based data augmentation method shows further improvements. Our results highlight the importance of using multimodal data and task knowledge for failure detection. However, the task-specific nature of existing models makes it impractical to collect data and train a separate model for every task. Using simulators to generate large-scale video data is a viable approach to this problem in future work. The emergence of general-purpose vision-language-action models also presents opportunities for task-agnostic failure detection. Incorporating failure data into their training datasets and, similar to our work, making use of multimodal data and task knowledge are likely to further accelerate progress in this area.	en
dc.language.iso	eng
dc.rights	In Copyright
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/
dc.subject	visuelle Fehlererkennung
dc.subject	Robotik
dc.subject	Deep Learning
dc.subject	Anomalieerkennung
dc.subject	Videoanalyse
dc.subject	visual failure detection
dc.subject	robotics
dc.subject	deep learning
dc.subject	anomaly detection
dc.subject	video analytics
dc.subject.ddc	004 Informatik
dc.title	Visual Failure Detection in Robotics
dc.type	Dissertation oder Habilitation
dc.identifier.doi	https://doi.org/10.48565/bonndoc-836
dc.publisher.name	Universitäts- und Landesbibliothek Bonn
dc.publisher.location	Bonn
dc.rights.accessRights	openAccess
dc.identifier.urn	https://nbn-resolving.org/urn:nbn:de:hbz:5-89239
dc.relation.doi	https://doi.org/10.1109/IROS51168.2021.9636133
dc.relation.doi	https://doi.org/10.1109/ICRA57147.2024.10610143
dc.relation.doi	https://doi.org/10.1109/ICPR56361.2022.9955646
dc.relation.doi	https://doi.org/10.1109/ECMR65884.2025.11162998
ulbbn.pubtype	Erstveröffentlichung
ulbbnediss.affiliation.name	Rheinische Friedrich-Wilhelms-Universität Bonn
ulbbnediss.affiliation.location	Bonn
ulbbnediss.thesis.level	Dissertation
ulbbnediss.dissID	8923
ulbbnediss.date.accepted	26.03.2026
ulbbnediss.institute	Mathematisch-Naturwissenschaftliche Fakultät : Fachgruppe Informatik / Institut für Informatik
ulbbnediss.fakultaet	Mathematisch-Naturwissenschaftliche Fakultät
dc.contributor.coReferee	Plöger, Paul G.
ulbbnediss.contributor.orcid	https://orcid.org/0000-0003-4085-4943

Files in this item

Name:: 8923.pdf
Size:: 17.8MB
Format:: PDF

View/Open

This item appears in the following Collection(s)

E-Dissertationen (4551)

Show simple item record

The following license files are associated with this item: