Recognizing and Anticipating Human Activities in Videos

Abu Farha, Yazan

Volltext

Dokument öffnen (20.8MB)

Autor

Abu Farha, Yazan

Art der Hochschulschrift

Dissertation

Prüfungsdatum

21.03.2022

Datum der Veröffentlichung

04.04.2022

Erstgutachter

Gall, Juergen

Zweitgutachter

Schiele, Bernt

Grad-verleihende Institutionen

Rheinische Friedrich-Wilhelms-Universität Bonn

Metadaten

Zur Langanzeige

Zitierbare Links

Handle: https://hdl.handle.net/20.500.11811/9726
URN: https://nbn-resolving.org/urn:nbn:de:hbz:5-66039

Inhalt

Analyzing human actions in videos has gained increased attention recently. Several approaches, for instance, have been proposed to classify and temporally segment actions in videos. Despite the success of these approaches, they only analyze fully observed videos where all actions in the videos have been carried out. For many applications, however, this is not sufficient and it is crucial to reason beyond what has been observed and anticipate the future. While most existing methods for anticipation predict only the very near future, making long-term predictions over more than just a few seconds is a task with many practical applications that has been overlooked in the literature. In this thesis, we therefore propose approaches that predict a considerably large number of future actions and their durations for up to several minutes into the future. Nonetheless, the ability to predict the future depends on the level of understanding of what has been observed. As recent approaches for analyzing activities in long untrimmed videos either suffer from over-segmentation errors or cannot be trained end-to-end, we first propose approaches to temporally segment activities in long videos to address the limitations of the previous approaches. Then, we combine the proposed segmentation and the anticipation approaches in an end-to-end framework for long-term anticipation of activities.
To this end, we first propose two multi-stage architectures based on temporal convolutional networks for the temporal action segmentation task. We further introduce a novel smoothing loss that penalizes over-segmentation errors and improves the quality of the predictions. While the proposed segmentation models show a strong performance in recognizing and segmenting activities, they depend on the availability of large fully annotated datasets. Annotating such datasets with the start and end time of each action segment is, however, very costly and time-consuming. Although several approaches have been proposed to reduce the level of required annotations to only an ordered list of the occurring actions in the video, the performance of these approaches is still much worse than fully supervised approaches. Therefore, we propose to use a new level of supervision based on timestamps, where only a single frame is annotated for each action segment. We demonstrate that models trained with timestamp supervision achieve comparable performance to fully supervised approaches using a tiny fraction of the annotation cost.
For the long-term anticipation of activities, we propose a two-step approach that first infers the actions in the observed frames using an action segmentation model, then the inferred actions are used to predict the future actions. As predicting longer into the future requires considering the uncertainty in the future actions, we further propose a framework that predicts a distribution over the future action segments and use it to sample multiple possible sequences of future actions.
To further improve the anticipation performance, we combine the proposed segmentation and the anticipation models in an end-to-end framework. Furthermore, as predicting the future from the past and the past from the future should be consistent, we introduce a cycle consistency loss over time by predicting the past activities given the predicted future. The end-to-end framework outperforms the two-step approach by a large margin.
For both the action segmentation and anticipation tasks, we evaluate the proposed approaches on several public datasets with long videos containing many action segments. Extensive evaluation shows the effectiveness of the proposed approaches in capturing long-range dependencies and generating accurate predictions.

Klassifikation (DDC)

004 Informatik

Zitiervorschlag
BibTeX

Abu Farha, Yazan: Recognizing and Anticipating Human Activities in Videos. - Bonn, 2022. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-66039

@phdthesis{handle:20.500.11811/9726,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-66039,
author = {{Yazan Abu Farha}},
title = {Recognizing and Anticipating Human Activities in Videos},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2022,
month = apr,
note = {Analyzing human actions in videos has gained increased attention recently. Several approaches, for instance, have been proposed to classify and temporally segment actions in videos. Despite the success of these approaches, they only analyze fully observed videos where all actions in the videos have been carried out. For many applications, however, this is not sufficient and it is crucial to reason beyond what has been observed and anticipate the future. While most existing methods for anticipation predict only the very near future, making long-term predictions over more than just a few seconds is a task with many practical applications that has been overlooked in the literature. In this thesis, we therefore propose approaches that predict a considerably large number of future actions and their durations for up to several minutes into the future. Nonetheless, the ability to predict the future depends on the level of understanding of what has been observed. As recent approaches for analyzing activities in long untrimmed videos either suffer from over-segmentation errors or cannot be trained end-to-end, we first propose approaches to temporally segment activities in long videos to address the limitations of the previous approaches. Then, we combine the proposed segmentation and the anticipation approaches in an end-to-end framework for long-term anticipation of activities.
To this end, we first propose two multi-stage architectures based on temporal convolutional networks for the temporal action segmentation task. We further introduce a novel smoothing loss that penalizes over-segmentation errors and improves the quality of the predictions. While the proposed segmentation models show a strong performance in recognizing and segmenting activities, they depend on the availability of large fully annotated datasets. Annotating such datasets with the start and end time of each action segment is, however, very costly and time-consuming. Although several approaches have been proposed to reduce the level of required annotations to only an ordered list of the occurring actions in the video, the performance of these approaches is still much worse than fully supervised approaches. Therefore, we propose to use a new level of supervision based on timestamps, where only a single frame is annotated for each action segment. We demonstrate that models trained with timestamp supervision achieve comparable performance to fully supervised approaches using a tiny fraction of the annotation cost.
For the long-term anticipation of activities, we propose a two-step approach that first infers the actions in the observed frames using an action segmentation model, then the inferred actions are used to predict the future actions. As predicting longer into the future requires considering the uncertainty in the future actions, we further propose a framework that predicts a distribution over the future action segments and use it to sample multiple possible sequences of future actions.
To further improve the anticipation performance, we combine the proposed segmentation and the anticipation models in an end-to-end framework. Furthermore, as predicting the future from the past and the past from the future should be consistent, we introduce a cycle consistency loss over time by predicting the past activities given the predicted future. The end-to-end framework outperforms the two-step approach by a large margin.
For both the action segmentation and anticipation tasks, we evaluate the proposed approaches on several public datasets with long videos containing many action segments. Extensive evaluation shows the effectiveness of the proposed approaches in capturing long-range dependencies and generating accurate predictions.},
url = {https://hdl.handle.net/20.500.11811/9726}
}

Die folgenden Nutzungsbestimmungen sind mit dieser Ressource verbunden: