Souri, Yaser: Temporal Action Segmentation: Weak Supervision and Efficiency. - Bonn, 2023. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-71804
@phdthesis{handle:20.500.11811/11007,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-71804,
author = {{Yaser Souri}},
title = {Temporal Action Segmentation: Weak Supervision and Efficiency},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2023,
month = aug,

note = {Action segmentation deals with the temporal understanding of actions that occur in an untrimmed video. The output of action segmentation is an action label for each frame of the input video. This task has recently received a lot of attention from both academy and industry as it has a lot of important applications such as home monitoring systems, guidance systems in assembly lines, and semantic video retrieval. While existing methods for action segmentation already achieve encouraging results, most require fully annotated datasets for training. Obtaining fully annotated untrimmed videos for training action segmentation models is quite expensive. This has resulted in a lot of recent progress on weakly supervised approaches for action segmentation from transcripts and timestamps. Most of the existing weakly supervised action segmentation approaches are very inefficient during training and testing while some others make unrealistic assumptions about the type of weak supervision available during training. In this dissertation, we focus on weakly supervised approaches for action segmentation that are efficient and fast during training and testing. We also focus on more realistic weak supervision in the form of timestamps without restrictive assumptions.
First, we propose a novel end-to-end approach for weakly supervised action segmentation from transcripts based on a two-branch neural network. The two branches of our network predict two redundant but different representations for action segmentation and we propose a novel mutual consistency (MuCon) loss that enforces the consistency of the two redundant representations. Using the MuCon loss together with a loss for transcript prediction, our proposed approach achieves the accuracy of state-of-the-art approaches while being 14 times faster to train and 20 times faster during inference. This approach provides a unified architecture that supports both fully and weakly supervised training and makes mixed supervision training possible.
Second, we introduce FIFA, a fast approximate inference method for action segmentation and alignment. Unlike previous approaches, FIFA does not rely on expensive dynamic programming for inference. Instead, it uses an approximate differentiable energy function that can be minimized using gradient-descent. FIFA is a general approach that can replace exact inference improving its speed by more than 5 times while maintaining its performance. FIFA is an anytime inference algorithm that provides a better speed vs. accuracy trade-off compared to exact inference.
Third, we focus on weak supervision in the form of timestamps. Timestamp supervision is a promising type of weak supervision as it is less expensive to obtain compared to full supervision and results in accurate action segmentation models. Previous works assume that every action segment is annotated with a timestamp which is an unrealistic and restrictive assumption. Robustness against missing annotations is of crucial importance as such mistakes can occur during the annotation process. We proposed an approach for action segmentation from timestamps that is robust to missing timestamp annotation of some action segments.
In robotics applications after recognizing the actions, the goal is to perform an action and complete a task. To this end, our final contribution is task-driven object detection. When humans have to solve everyday tasks, they simply pick the most suitable objects. While trivial for humans, this issue is not addressed by current benchmarks and approaches for object detection that focus on detecting object categories. We, therefore, introduce the COCO-Tasks dataset which comprises about 40,000 images where the most suitable objects for 14 tasks have been annotated. We furthermore propose an approach that detects the most suitable objects for a given task. The approach builds on a Gated Graph Neural Network to exploit the appearance of each object as well as the global context of all present objects in the scene.
We evaluate our approaches on various benchmarks and datasets. Extensive evaluations and comparisons with state-of-the-art and various baselines show the effectiveness of our proposed approaches and their efficiency.},

url = {https://hdl.handle.net/20.500.11811/11007}
}

The following license files are associated with this item:

InCopyright