Modeling Human Actions in Multi-Label settings

Biswas, Sovan

Volltext

View/Open (4.7MB)

Author

Biswas, Sovan

ORCID

https://orcid.org/0000-0002-9866-8433

Type of Scholarly Publication

Dissertation

Date of Exam

09.07.2025

Date of Publication

15.09.2025

Advisor

Gall, Jürgen

Co-Referee

Kühne, Hilde

Degree Granting Institutions

Rheinische Friedrich-Wilhelms-Universität Bonn

Metadata

Show full item record

Citable Links

Handle: https://hdl.handle.net/20.500.11811/13448
URN: https://nbn-resolving.org/urn:nbn:de:hbz:5-84141
DOI: https://doi.org/10.48565/bonndoc-653

Abstract

Human actions are often complex and occur in dynamic contexts, posing a challenge for traditional recognition models. This challenge is further exaggerated due to humans' innate multi-tasking nature, i.e. a person typically performs multiple actions at the same time. This thesis delves into multi-label human action recognition and analysis, bridging the gap between single and group activities. Furthermore, the thesis acknowledges the cost of labeled data required for training and discusses novel approaches to develop models in various fully supervised settings to data-scarce, weakly supervised environments.
The core contributions lie in developing novel neural network architectures that can capture the intricacies of multi-label action recognition. Our first contribution is on Structural Recurrent Neural Networks (SRNNs) for group activity analysis. These networks capture individual actions, interactions between individuals, and the overall group activity, agnostic to the size of the group. Moving from group activity, we also proposed a Hierarchical Graph-RNN that specifically tackles multiple individual actions. This architecture incorporates the temporal context and relationships between different actions to achieve accurate multi-label recognition in space and time.
Beyond fully supervised settings, we also explored weakly supervised learning, where action annotations are scarce. Here, our approaches rely on sets of actions instead of individual classes as annotations that are cost and time-effective to obtain. Our initial approach uses Multi-Instance Multi-Label (MIML) Learning followed by constraint-based Linear programming to map the set of actions to individual humans in a video. Furthermore, the thesis addresses the challenge of longer videos in weakly supervised settings. Here, a novel Multiple Instance Triplet Loss (MITL) exploits temporal similarity across consecutive frames in comparison to temporal distant frames to train the action recognition model effectively.
Through this dissertation, we advanced the state-of-the-art in multi-label action analysis, proposed novel architectures for group and individual action recognition exploiting temporal and spatial context, and finally, explored approaches to develop models for weakly supervised settings. We demonstrated the effectiveness of our approaches through comprehensive experimentation and by comparing them with existing state-of-the-art on well-known public benchmarks. In the end, we conclude by discussing the open challenges and possible future research directions for multi-label human action analysis.

Subjects

Multi-label action recognition, Spatio-temporal action detection, Weak-supervision

Classification (DDC)

004 Informatik

Related Publications

https://doi.org/10.48550/arXiv.1802.02091
https://doi.org/10.48550/arXiv.2101.08581
https://doi.org/10.48550/arXiv.2101.08567
https://doi.org/10.1109/ICCVW54120.2021.00245

Zitiervorschlag
BibTeX

Biswas, Sovan: Modeling Human Actions in Multi-Label settings. - Bonn, 2025. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-84141

@phdthesis{handle:20.500.11811/13448,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-84141,
doi: https://doi.org/10.48565/bonndoc-653,
author = {{Sovan Biswas}},
title = {Modeling Human Actions in Multi-Label settings},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2025,
month = sep,
note = {Human actions are often complex and occur in dynamic contexts, posing a challenge for traditional recognition models. This challenge is further exaggerated due to humans' innate multi-tasking nature, i.e. a person typically performs multiple actions at the same time. This thesis delves into multi-label human action recognition and analysis, bridging the gap between single and group activities. Furthermore, the thesis acknowledges the cost of labeled data required for training and discusses novel approaches to develop models in various fully supervised settings to data-scarce, weakly supervised environments.
The core contributions lie in developing novel neural network architectures that can capture the intricacies of multi-label action recognition. Our first contribution is on Structural Recurrent Neural Networks (SRNNs) for group activity analysis. These networks capture individual actions, interactions between individuals, and the overall group activity, agnostic to the size of the group. Moving from group activity, we also proposed a Hierarchical Graph-RNN that specifically tackles multiple individual actions. This architecture incorporates the temporal context and relationships between different actions to achieve accurate multi-label recognition in space and time.
Beyond fully supervised settings, we also explored weakly supervised learning, where action annotations are scarce. Here, our approaches rely on sets of actions instead of individual classes as annotations that are cost and time-effective to obtain. Our initial approach uses Multi-Instance Multi-Label (MIML) Learning followed by constraint-based Linear programming to map the set of actions to individual humans in a video. Furthermore, the thesis addresses the challenge of longer videos in weakly supervised settings. Here, a novel Multiple Instance Triplet Loss (MITL) exploits temporal similarity across consecutive frames in comparison to temporal distant frames to train the action recognition model effectively.
Through this dissertation, we advanced the state-of-the-art in multi-label action analysis, proposed novel architectures for group and individual action recognition exploiting temporal and spatial context, and finally, explored approaches to develop models for weakly supervised settings. We demonstrated the effectiveness of our approaches through comprehensive experimentation and by comparing them with existing state-of-the-art on well-known public benchmarks. In the end, we conclude by discussing the open challenges and possible future research directions for multi-label human action analysis.
},
url = {https://hdl.handle.net/20.500.11811/13448}
}

The following license files are associated with this item: