Show simple item record

Self-Supervised Video Representation Learning and Downstream Applications

dc.contributor.advisorGall, Juergen
dc.contributor.authorBehrmann, Nadine
dc.date.accessioned2023-05-12T12:11:53Z
dc.date.available2023-05-12T12:11:53Z
dc.date.issued12.05.2023
dc.identifier.urihttps://hdl.handle.net/20.500.11811/10837
dc.description.abstractVideo understanding is an important computer vision task with a great variety of different applications, such as autonomous driving, robotics and video surveillance. Much progress has recently been made on video understanding tasks related to short trimmed videos, such as recognizing human activities in videos. To a large extent, this progress can be attributed to the existence of large-scale labelled datasets. However, labelling large-scale video datasets quickly becomes prohibitively expensive, especially for tasks involving long-term video understanding of untrimmed videos, such as temporal action segmentation. Here, datasets are still relatively small and methods rely on pretrained video representations. More generally, the way the data is represented has a significant impact on how easy or difficult a task is to solve, and learning better video representations has the potential to greatly facilitate many video understanding tasks. Although supervised learning is the most prevalent representation learning approach, it is prone to miss relevant features, e.g. the current popular large-scale datasets are inherently biased towards static features. To alleviate this shortcoming, we explore different self-supervised video representation learning methods. This not only allows us to put an explicit focus on temporal features, but also enables the use of the vast amounts of unlabelled video data available online.
First, we investigate how unobserved past frames can be jointly incorporated with future frames to pose a more challenging pretext task that encourages temporally structured representations. To that end, we propose a bidirectional feature prediction task in a contrastive learning framework that not only requires the model to predict past and future video features but also to distinguish between them via temporal hard negatives. Second, we develop a general contrastive learning method to properly accommodate a set of positives ranked based on the desired similarity to the query clip. As the content of videos changes gradually over time, so should the representations; we achieve this by ranking video clips based on their temporal distance. Third, we develop a method for learning stationary and non-stationary features, motivated by the observation that a diverse set of downstream tasks requires different kinds of features: Stationary features capture global, video-level attributes such as the action class, while non-stationary features are beneficial for tasks that require more fine-grained temporal features such as temporal action segmentation.
Finally, we propose a method for action segmentation that is built on top of pretrained video representations. In contrast to previous works, which are based on framewise predictions, we view action segmentation from a sequence-to-sequence perspective – mapping a sequence of video frames to a sequence of action segments – and design a Transformer- based model that directly predicts the segments.
en
dc.language.isoeng
dc.rightsIn Copyright
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/
dc.subjectvideo representation learning
dc.subjectself-supervised learning
dc.subjectaction segmentation
dc.subject.ddc004 Informatik
dc.titleSelf-Supervised Video Representation Learning and Downstream Applications
dc.typeDissertation oder Habilitation
dc.publisher.nameUniversitäts- und Landesbibliothek Bonn
dc.publisher.locationBonn
dc.rights.accessRightsopenAccess
dc.identifier.urnhttps://nbn-resolving.org/urn:nbn:de:hbz:5-70416
dc.relation.doihttps://doi.org/10.1007/978-3-031-19833-5_4
ulbbn.pubtypeErstveröffentlichung
ulbbnediss.affiliation.nameRheinische Friedrich-Wilhelms-Universität Bonn
ulbbnediss.affiliation.locationBonn
ulbbnediss.thesis.levelDissertation
ulbbnediss.dissID7041
ulbbnediss.date.accepted10.03.2023
ulbbnediss.instituteMathematisch-Naturwissenschaftliche Fakultät : Fachgruppe Informatik / Institut für Informatik
ulbbnediss.fakultaetMathematisch-Naturwissenschaftliche Fakultät
dc.contributor.coRefereeStiefelhagen, Rainer
ulbbnediss.contributor.gnd1329349989


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

The following license files are associated with this item:

InCopyright