Fayyaz, Mohsen: Holistic Video Understanding: Spatio-Temporal Modeling and Efficiency. - Bonn, 2024. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-75662
@phdthesis{handle:20.500.11811/11546,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-75662,
author = {{Mohsen Fayyaz}},
title = {Holistic Video Understanding: Spatio-Temporal Modeling and Efficiency},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2024,
month = may,

note = {Video understanding, a specialized area within computer vision, focuses on classifying human actions in video clips. The field has seen remarkable advancements, particularly with the introduction of 3D convolutional neural networks (3D CNNs). However, these networks often struggle to capture information in spatial and temporal correlations effectively, thereby limiting their performance. To address this, we propose a new network architecture block capable of efficiently capturing both spatial-channel and temporal-channel correlation information across the network layers.
Despite these advancements, training 3D CNNs remains a challenging task due to the need for extensive labeled datasets. To circumvent these challenges, we introduce a supervision transfer method that leverages a 2D CNN pre-trained on ImageNet to guide the weight initialization of a 3D CNN. This technique effectively reduces the need for comprehensive training from scratch.
Another challenge in the field of video understanding is the absence of established video benchmarks. These benchmarks are crucial for the joint recognition of multiple semantic aspects in dynamic scenes. To fill this gap, we have developed the "Holistic Video Understanding" (HVU) dataset. The HVU dataset serves as a large-scale video benchmark with comprehensive tasks and annotations designed to facilitate video analysis and understanding.
Building on this foundation, we introduce a novel spatio-temporal architecture for video classification. This architecture employs a multi-label, multi-task learning approach, merging 2D and 3D architectures to capture both spatial and temporal information. This integrated approach allows us to address multiple spatio-temporal problems simultaneously, enhancing the overall effectiveness of video understanding.
In addition to these advancements, we also strive to improve the efficiency of 3D CNNs for both training and inference. To achieve this, we present a dynamic approach that adapts the temporal resolution of the network to match the input frames. This method enables the 3D CNN to select the most valuable temporal features for the action classification task, optimizing computational resources.
We also explore the potential of vision transformers in image and video classification tasks. Despite their effectiveness, the high computational cost of these transformers often makes them impractical for deployment on edge devices. To overcome this, we propose a method that reduces the number of tokens in a vision transformer by automatically selecting an appropriate number based on the image or video content.
Finally, we delve into the complex task of temporal action segmentation, which involves recognizing multiple actions within untrimmed videos. Due to the cost and complexity of annotating each frame of a video, we propose an end-to-end trainable approach that learns temporal action segmentation from weakly labeled videos. Building on this weak supervision approach, we also propose a novel framework for action segmentation using a two-branch neural network. This dual-branch network predicts redundant yet distinct action segmentation representations, enhancing the overall accuracy and efficiency of video understanding.},

url = {https://hdl.handle.net/20.500.11811/11546}
}

The following license files are associated with this item:

InCopyright