Visual Failure Detection in Robotics

Thoduka, Santosh George

Volltext

Dokument öffnen (17.8MB)

Autor

Thoduka, Santosh George

ORCID

https://orcid.org/0000-0003-4085-4943

Art der Hochschulschrift

Dissertation

Prüfungsdatum

26.03.2026

Datum der Veröffentlichung

31.03.2026

Erstgutachter

Gall, Jürgen

Zweitgutachter

Plöger, Paul G.

Grad-verleihende Institutionen

Rheinische Friedrich-Wilhelms-Universität Bonn

Metadaten

Zur Langanzeige

Zitierbare Links

Handle: https://hdl.handle.net/20.500.11811/14054
URN: https://nbn-resolving.org/urn:nbn:de:hbz:5-89239
DOI: https://doi.org/10.48565/bonndoc-836

Inhalt

Autonomous robots in human-centric environments can encounter unforeseen situations, making them prone to task execution failures. These failures can lead to unsafe conditions for both humans and robots and result in a loss of trust in robots. Therefore, robots should have the ability to prevent, detect, and respond to failures. In this thesis, we focus on the detection of failures, particularly using video data from the robot's camera. Several visual failure detection datasets have emerged in recent years; however, there is still a scarcity of datasets suitable for building general-purpose failure detection approaches. Existing work has shown the benefit of using multimodal data for failure detection, but determining the best data representation and fusion methods remains an open challenge. Additionally, although most approaches develop failure detection models for specific tasks, they often fail to incorporate task knowledge into the learning process. In our work, we introduce multimodal datasets and explore multimodal learning and task knowledge integration to enhance failure detection performance. We contribute two visual failure detection datasets: the Bookshelf dataset and the Handover Failure Detection dataset, which consist of various failures that occur when the robot is placing a book on a shelf and performing object handovers with people. For the Bookshelf dataset, we use video and proprioceptive data to detect anomalous situations by comparing expected and observed motions. For the Handover dataset and a visual-tactile dataset, we find that intermediate fusion of video, force-torque, tactile and proprioceptive features performs best. For the Handover dataset, video was found to be an essential modality, and learning to predict the human's and robot's actions as auxiliary tasks is also beneficial. Finally, we explore incorporating task knowledge to improve failure classification performance. We show that pre-processing video frames using the known temporal boundaries of the robot’s actions and locations of objects in the scene improves results on a large-scale failure dataset, and a variable frame rate-based data augmentation method shows further improvements. Our results highlight the importance of using multimodal data and task knowledge for failure detection. However, the task-specific nature of existing models makes it impractical to collect data and train a separate model for every task. Using simulators to generate large-scale video data is a viable approach to this problem in future work. The emergence of general-purpose vision-language-action models also presents opportunities for task-agnostic failure detection. Incorporating failure data into their training datasets and, similar to our work, making use of multimodal data and task knowledge are likely to further accelerate progress in this area.

Schlagwörter

visuelle Fehlererkennung, Robotik, Deep Learning, Anomalieerkennung, Videoanalyse, visual failure detection, robotics, deep learning, anomaly detection, video analytics

Klassifikation (DDC)

004 Informatik

Zugehörige Publikation(en)

https://doi.org/10.1109/IROS51168.2021.9636133
https://doi.org/10.1109/ICRA57147.2024.10610143
https://doi.org/10.1109/ICPR56361.2022.9955646
https://doi.org/10.1109/ECMR65884.2025.11162998

Zitiervorschlag
BibTeX

Thoduka, Santosh George: Visual Failure Detection in Robotics. - Bonn, 2026. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-89239

@phdthesis{handle:20.500.11811/14054,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-89239,
doi: https://doi.org/10.48565/bonndoc-836,
author = {{Santosh George Thoduka}},
title = {Visual Failure Detection in Robotics},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2026,
month = mar,
note = {Autonomous robots in human-centric environments can encounter unforeseen situations, making them prone to task execution failures. These failures can lead to unsafe conditions for both humans and robots and result in a loss of trust in robots. Therefore, robots should have the ability to prevent, detect, and respond to failures. In this thesis, we focus on the detection of failures, particularly using video data from the robot's camera. Several visual failure detection datasets have emerged in recent years; however, there is still a scarcity of datasets suitable for building general-purpose failure detection approaches. Existing work has shown the benefit of using multimodal data for failure detection, but determining the best data representation and fusion methods remains an open challenge. Additionally, although most approaches develop failure detection models for specific tasks, they often fail to incorporate task knowledge into the learning process. In our work, we introduce multimodal datasets and explore multimodal learning and task knowledge integration to enhance failure detection performance. We contribute two visual failure detection datasets: the Bookshelf dataset and the Handover Failure Detection dataset, which consist of various failures that occur when the robot is placing a book on a shelf and performing object handovers with people. For the Bookshelf dataset, we use video and proprioceptive data to detect anomalous situations by comparing expected and observed motions. For the Handover dataset and a visual-tactile dataset, we find that intermediate fusion of video, force-torque, tactile and proprioceptive features performs best. For the Handover dataset, video was found to be an essential modality, and learning to predict the human's and robot's actions as auxiliary tasks is also beneficial. Finally, we explore incorporating task knowledge to improve failure classification performance. We show that pre-processing video frames using the known temporal boundaries of the robot’s actions and locations of objects in the scene improves results on a large-scale failure dataset, and a variable frame rate-based data augmentation method shows further improvements. Our results highlight the importance of using multimodal data and task knowledge for failure detection. However, the task-specific nature of existing models makes it impractical to collect data and train a separate model for every task. Using simulators to generate large-scale video data is a viable approach to this problem in future work. The emergence of general-purpose vision-language-action models also presents opportunities for task-agnostic failure detection. Incorporating failure data into their training datasets and, similar to our work, making use of multimodal data and task knowledge are likely to further accelerate progress in this area.},
url = {https://hdl.handle.net/20.500.11811/14054}
}

Die folgenden Nutzungsbestimmungen sind mit dieser Ressource verbunden: