Behrmann, Nadine: Self-Supervised Video Representation Learning and Downstream Applications. - Bonn, 2023. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-70416
@phdthesis{handle:20.500.11811/10837,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-70416,
author = {{Nadine Behrmann}},
title = {Self-Supervised Video Representation Learning and Downstream Applications},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2023,
month = may,

note = {Video understanding is an important computer vision task with a great variety of different applications, such as autonomous driving, robotics and video surveillance. Much progress has recently been made on video understanding tasks related to short trimmed videos, such as recognizing human activities in videos. To a large extent, this progress can be attributed to the existence of large-scale labelled datasets. However, labelling large-scale video datasets quickly becomes prohibitively expensive, especially for tasks involving long-term video understanding of untrimmed videos, such as temporal action segmentation. Here, datasets are still relatively small and methods rely on pretrained video representations. More generally, the way the data is represented has a significant impact on how easy or difficult a task is to solve, and learning better video representations has the potential to greatly facilitate many video understanding tasks. Although supervised learning is the most prevalent representation learning approach, it is prone to miss relevant features, e.g. the current popular large-scale datasets are inherently biased towards static features. To alleviate this shortcoming, we explore different self-supervised video representation learning methods. This not only allows us to put an explicit focus on temporal features, but also enables the use of the vast amounts of unlabelled video data available online.
First, we investigate how unobserved past frames can be jointly incorporated with future frames to pose a more challenging pretext task that encourages temporally structured representations. To that end, we propose a bidirectional feature prediction task in a contrastive learning framework that not only requires the model to predict past and future video features but also to distinguish between them via temporal hard negatives. Second, we develop a general contrastive learning method to properly accommodate a set of positives ranked based on the desired similarity to the query clip. As the content of videos changes gradually over time, so should the representations; we achieve this by ranking video clips based on their temporal distance. Third, we develop a method for learning stationary and non-stationary features, motivated by the observation that a diverse set of downstream tasks requires different kinds of features: Stationary features capture global, video-level attributes such as the action class, while non-stationary features are beneficial for tasks that require more fine-grained temporal features such as temporal action segmentation.
Finally, we propose a method for action segmentation that is built on top of pretrained video representations. In contrast to previous works, which are based on framewise predictions, we view action segmentation from a sequence-to-sequence perspective – mapping a sequence of video frames to a sequence of action segments – and design a Transformer- based model that directly predicts the segments.},

url = {https://hdl.handle.net/20.500.11811/10837}
}

Die folgenden Nutzungsbestimmungen sind mit dieser Ressource verbunden:

InCopyright