Robust and Interpretable Visual Perception Using Deep Neural Networks

Wagner, Jörg

Volltext

Dokument öffnen (42.9MB)

Autor

Wagner, Jörg

ORCID

https://orcid.org/0000-0001-9636-3700

Art der Hochschulschrift

Dissertation

Prüfungsdatum

18.11.2022

Datum der Veröffentlichung

11.01.2023

Erstgutachter

Behnke, Sven

Zweitgutachter

Gall, Jürgen

Grad-verleihende Institutionen

Rheinische Friedrich-Wilhelms-Universität Bonn

Metadaten

Zur Langanzeige

Zitierbare Links

Handle: https://hdl.handle.net/20.500.11811/10573
URN: https://nbn-resolving.org/urn:nbn:de:hbz:5-69413

Inhalt

Autonomous vehicles promise to revolutionize the transportation of people and goods by increasing road safety, reducing resource consumption, and improving quality of life. To achieve an unrestricted and large-scale deployment in the real world without any human supervision, many challenges still need to be solved. A key challenge is the robust perception and interpretation of the surroundings. Deep learning-based approaches have significantly advanced the creation of robust environment representations in recent years. However, further improvements are required, for example, to cope with difficult environment conditions (adverse weather, low lighting conditions, ...). In the first part of this thesis, we investigate approaches to improve the robustness of vision-based perception models. One promising approach is to fuse data of multiple complementary sensors. Building on previous deep learning-based pedestrian detectors operating on visible images, we develop a multispectral detector. Our detector combines the data of a visible and a thermal camera using a deep fusion network and provides significantly better results than comparable single sensor models. To the best of our knowledge, this is the first work to use a deep learning-based approach for multispectral pedestrian detection. A complementary method for improving perception performance is the temporal filtering of information. The filtering task can be divided into a prediction and an update step. Initially, we explore the prediction step and propose an approach for generating semantic forecasting models by transforming trained non-predictive feed-forward networks. The predictive transformation is based on a structural extension of the network using a recurrent predictive module and a teacher-student training strategy. The resulting semantic forecasting architecture models the dynamics of the scene, enabling meaningful predictions. Building on the knowledge gained, we design a parameter efficient approach to temporally filter the representations of Fully Convolutional DenseNet (FC-DenseNet) in a hierarchical manner. Based on a simulated dataset with significant disturbances (e.g. noise, occlusions, missing information), we show the advantages of temporal filtering and compare our architecture with similar temporal networks.
A disadvantage of many modern perception models is their black-box character. Especially in safety-critical applications, a high degree of transparency and interpretability is beneficial, as it facilitates model development, debugging, failure analysis, and validation. In the second part of this thesis, we study two approaches to increase transparency and interpretability of deep learning-based models. First, we consider the concept of interpretability by design and propose a modular and interpretable representation filter which divides the filtering task into multiple, less complex subtasks. Additional insights into its functioning are gained by introducing intermediate representations which are interpretable to humans. These representations also enable the integration of domain knowledge. Using our proposed filter, we increase the robustness of a semantic segmentation model. As an alternative to designing more interpretable architectures, post-hoc explanation methods can be used to gain insights into the decision-making process of a model. We present such a method, which creates visual explanations in the images space by successively removing either relevant or irrelevant pixels from the input image. A core component of our approach is a novel technique to defend against adversarial evidence (i.e. faulty evidence due to artifacts). Using a multitude of experiments, we show the ability of our method to create fine-grained and class-discriminative explanations which are faithful to the model.

Schlagwörter

Deep Learning, Recurrent Neural Networks, Sensor Fusion, Semantic Forecasting, Semantic Segmentation, Multispectral Pedestrian Detection, Temporal Filtering, Interpretability by Design, Visual Explanation Methods

Klassifikation (DDC)

004 Informatik

Zugehörige Publikation(en)

arXiv:1908.02686
arXiv:1810.03867
arXiv:1810.02766
https://doi.org/10.1109/CVPR.2019.00931

Zitiervorschlag
BibTeX

Wagner, Jörg: Robust and Interpretable Visual Perception Using Deep Neural Networks. - Bonn, 2023. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-69413

@phdthesis{handle:20.500.11811/10573,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-69413,
author = {{Jörg Wagner}},
title = {Robust and Interpretable Visual Perception Using Deep Neural Networks},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2023,
month = jan,
note = {Autonomous vehicles promise to revolutionize the transportation of people and goods by increasing road safety, reducing resource consumption, and improving quality of life. To achieve an unrestricted and large-scale deployment in the real world without any human supervision, many challenges still need to be solved. A key challenge is the robust perception and interpretation of the surroundings. Deep learning-based approaches have significantly advanced the creation of robust environment representations in recent years. However, further improvements are required, for example, to cope with difficult environment conditions (adverse weather, low lighting conditions, ...). In the first part of this thesis, we investigate approaches to improve the robustness of vision-based perception models. One promising approach is to fuse data of multiple complementary sensors. Building on previous deep learning-based pedestrian detectors operating on visible images, we develop a multispectral detector. Our detector combines the data of a visible and a thermal camera using a deep fusion network and provides significantly better results than comparable single sensor models. To the best of our knowledge, this is the first work to use a deep learning-based approach for multispectral pedestrian detection. A complementary method for improving perception performance is the temporal filtering of information. The filtering task can be divided into a prediction and an update step. Initially, we explore the prediction step and propose an approach for generating semantic forecasting models by transforming trained non-predictive feed-forward networks. The predictive transformation is based on a structural extension of the network using a recurrent predictive module and a teacher-student training strategy. The resulting semantic forecasting architecture models the dynamics of the scene, enabling meaningful predictions. Building on the knowledge gained, we design a parameter efficient approach to temporally filter the representations of Fully Convolutional DenseNet (FC-DenseNet) in a hierarchical manner. Based on a simulated dataset with significant disturbances (e.g. noise, occlusions, missing information), we show the advantages of temporal filtering and compare our architecture with similar temporal networks.
A disadvantage of many modern perception models is their black-box character. Especially in safety-critical applications, a high degree of transparency and interpretability is beneficial, as it facilitates model development, debugging, failure analysis, and validation. In the second part of this thesis, we study two approaches to increase transparency and interpretability of deep learning-based models. First, we consider the concept of interpretability by design and propose a modular and interpretable representation filter which divides the filtering task into multiple, less complex subtasks. Additional insights into its functioning are gained by introducing intermediate representations which are interpretable to humans. These representations also enable the integration of domain knowledge. Using our proposed filter, we increase the robustness of a semantic segmentation model. As an alternative to designing more interpretable architectures, post-hoc explanation methods can be used to gain insights into the decision-making process of a model. We present such a method, which creates visual explanations in the images space by successively removing either relevant or irrelevant pixels from the input image. A core component of our approach is a novel technique to defend against adversarial evidence (i.e. faulty evidence due to artifacts). Using a multitude of experiments, we show the ability of our method to create fine-grained and class-discriminative explanations which are faithful to the model.},
url = {https://hdl.handle.net/20.500.11811/10573}
}

Die folgenden Nutzungsbestimmungen sind mit dieser Ressource verbunden: