Self-Supervised Video Representation Learning and Downstream Applications

Behrmann, Nadine

dc.contributor.advisor	Gall, Juergen
dc.contributor.author	Behrmann, Nadine
dc.date.accessioned	2023-05-12T12:11:53Z
dc.date.available	2023-05-12T12:11:53Z
dc.date.issued	12.05.2023
dc.identifier.uri	https://hdl.handle.net/20.500.11811/10837
dc.description.abstract	Video understanding is an important computer vision task with a great variety of different applications, such as autonomous driving, robotics and video surveillance. Much progress has recently been made on video understanding tasks related to short trimmed videos, such as recognizing human activities in videos. To a large extent, this progress can be attributed to the existence of large-scale labelled datasets. However, labelling large-scale video datasets quickly becomes prohibitively expensive, especially for tasks involving long-term video understanding of untrimmed videos, such as temporal action segmentation. Here, datasets are still relatively small and methods rely on pretrained video representations. More generally, the way the data is represented has a significant impact on how easy or difficult a task is to solve, and learning better video representations has the potential to greatly facilitate many video understanding tasks. Although supervised learning is the most prevalent representation learning approach, it is prone to miss relevant features, e.g. the current popular large-scale datasets are inherently biased towards static features. To alleviate this shortcoming, we explore different self-supervised video representation learning methods. This not only allows us to put an explicit focus on temporal features, but also enables the use of the vast amounts of unlabelled video data available online. First, we investigate how unobserved past frames can be jointly incorporated with future frames to pose a more challenging pretext task that encourages temporally structured representations. To that end, we propose a bidirectional feature prediction task in a contrastive learning framework that not only requires the model to predict past and future video features but also to distinguish between them via temporal hard negatives. Second, we develop a general contrastive learning method to properly accommodate a set of positives ranked based on the desired similarity to the query clip. As the content of videos changes gradually over time, so should the representations; we achieve this by ranking video clips based on their temporal distance. Third, we develop a method for learning stationary and non-stationary features, motivated by the observation that a diverse set of downstream tasks requires different kinds of features: Stationary features capture global, video-level attributes such as the action class, while non-stationary features are beneficial for tasks that require more fine-grained temporal features such as temporal action segmentation. Finally, we propose a method for action segmentation that is built on top of pretrained video representations. In contrast to previous works, which are based on framewise predictions, we view action segmentation from a sequence-to-sequence perspective – mapping a sequence of video frames to a sequence of action segments – and design a Transformer- based model that directly predicts the segments.	en
dc.language.iso	eng
dc.rights	In Copyright
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/
dc.subject	video representation learning
dc.subject	self-supervised learning
dc.subject	action segmentation
dc.subject.ddc	004 Informatik
dc.title	Self-Supervised Video Representation Learning and Downstream Applications
dc.type	Dissertation oder Habilitation
dc.publisher.name	Universitäts- und Landesbibliothek Bonn
dc.publisher.location	Bonn
dc.rights.accessRights	openAccess
dc.identifier.urn	https://nbn-resolving.org/urn:nbn:de:hbz:5-70416
dc.relation.doi	https://doi.org/10.1007/978-3-031-19833-5_4
ulbbn.pubtype	Erstveröffentlichung
ulbbnediss.affiliation.name	Rheinische Friedrich-Wilhelms-Universität Bonn
ulbbnediss.affiliation.location	Bonn
ulbbnediss.thesis.level	Dissertation
ulbbnediss.dissID	7041
ulbbnediss.date.accepted	10.03.2023
ulbbnediss.institute	Mathematisch-Naturwissenschaftliche Fakultät : Fachgruppe Informatik / Institut für Informatik
ulbbnediss.fakultaet	Mathematisch-Naturwissenschaftliche Fakultät
dc.contributor.coReferee	Stiefelhagen, Rainer
ulbbnediss.contributor.gnd	1329349989

Dateien zu dieser Ressource

Name:: 7041.pdf
Größe:: 28.9MB
Format:: PDF

Dokument öffnen

Das Dokument erscheint in:

E-Dissertationen (4311)

Zur Kurzanzeige

Die folgenden Nutzungsbestimmungen sind mit dieser Ressource verbunden: