Subtrajectory Clustering, Curve Averaging and the Complexity of Underlying Range Spaces

Brüning, Frederik Sebastian

Volltext

View/Open (7MB)

Author

Brüning, Frederik Sebastian

Type of Scholarly Publication

Dissertation

Date of Exam

19.11.2024

Date of Publication

13.12.2024

Advisor

Driemel, Anne

Co-Referee

Röglin, Heiko

Degree Granting Institutions

Rheinische Friedrich-Wilhelms-Universität Bonn

Metadata

Show full item record

Citable Links

Handle: https://hdl.handle.net/20.500.11811/12636
URN: https://nbn-resolving.org/urn:nbn:de:hbz:5-80005
DOI: https://doi.org/10.48565/bonndoc-442

Abstract

In this thesis, we study the clustering of spatial data with a focus on trajectory data. Trajectory data appears in various applications. These range from the recorded positions of moving objects (e.g. animals, humans, vehicles) to the change of measurements over time (e.g. biomarkers, electricity demand, temperature, sea level). A trajectory is usually modeled as a polygonal curve that is derived from the data by linear interpolation between consecutive observations. A clustering area that we are particularly interested in is subtrajectory clustering which consists of finding reoccurring patterns in trajectory data. We model subtrajectory clustering as a set cover problem and measure similarity based on the Fréchet distance. Given a polygonal curve with n vertices, the goal is to find the smallest set of center curves of complexity l such that each point on the input curve is part of a subcurve that has Fréchet distance of at most a given Delta to at least one of the center curves. We design bicriterial-approximation algorithms for this NP-hard problem. If there exists a solution of size k, then our algorithms find solutions of size O(k l log(k) log(l)) that solve the problem under distance O(Delta). The expected running time and space requirement of our algorithms is polynomial in k, l, n, 1/Delta and the arclength of the input curve. Our approach uses a variation of the multiplicative weight update method on a simplified version of the problem.
The second clustering problem that we study is curve averaging: the problem of optimizing the center curve for a fixed set of curves. In particular, we study a widely used heuristic for curve averaging under the dynamic time warping (DTW) distance called the DTW Barycenter Averaging (DBA) algorithm. The algorithm is very similar to the popular k-means algorithm. Given an initial center curve, it alternates between assignment and update steps until convergence. We study the number of iterations that DBA performs until it converges to a local optimal solution. We assume that DBA is given n polygonal curves of m points in R^d and a parameter k that specifies the length of the average curve to be computed. We conduct experiments that support the general view that DBA converges fast in practice. In contrast, we show that in the worst-case the number of iterations can be exponential in k. This gap between practical performance and worst-case analysis suggests that the worst-case behaviour is likely degenerate. To analyze the number of iterations on non-degenerate input, we further study DBA in the model of smoothed analysis. This model is based on bounding the expected number of iterations in the worst-case under random perturbation of the input. We achieve a bound that is polynomial in k, n and d, and for constant n/d is also polynomial in m.
Additionally, we study the complexity of range spaces underlying clustering problems where ranges are balls that are implicitly given by a center and a radius Delta and include all elements that are at a distance of at most Delta to the center. As distance measures, we consider Hausdorff distance, Fréchet distance and DTW. As centers and elements of the ground set, we consider polygonal curves in R^d and polygonal regions in R^2. To measure the complexity, we bound the VC-dimension and the shattering dimension of the resulting range spaces. Our approach is based on splitting range queries for the considered range spaces into simple predicates that can be determined by sign values of polynomials. This enables us to bound the VC-dimension and shattering dimension based on the number of cells in the arrangement of zero sets of these polynomials.

Subjects

Clustering, Subtrajectory, Fréchet distance, dynamic time warping, Hausdorff distance, VC dimension, range space, curve averaging, DBA, Smoothed Analysis, Multiplicative weight update, epsilon-nets

Classification (DDC)

004 Informatik

Related Publications

arXiv:2308.05998
arXiv:2401.05841
https://doi.org/10.57717/cgt.v2i1.7
https://doi.org/10.4230/LIPIcs.ESA.2022.28
https://doi.org/10.1137/1.9781611978032.20

Zitiervorschlag
BibTeX

Brüning, Frederik Sebastian: Subtrajectory Clustering, Curve Averaging and the Complexity of Underlying Range Spaces. - Bonn, 2024. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-80005

@phdthesis{handle:20.500.11811/12636,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-80005,
doi: https://doi.org/10.48565/bonndoc-442,
author = {{Frederik Sebastian Brüning}},
title = {Subtrajectory Clustering, Curve Averaging and the Complexity of Underlying Range Spaces},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2024,
month = dec,
note = {In this thesis, we study the clustering of spatial data with a focus on trajectory data. Trajectory data appears in various applications. These range from the recorded positions of moving objects (e.g. animals, humans, vehicles) to the change of measurements over time (e.g. biomarkers, electricity demand, temperature, sea level). A trajectory is usually modeled as a polygonal curve that is derived from the data by linear interpolation between consecutive observations. A clustering area that we are particularly interested in is subtrajectory clustering which consists of finding reoccurring patterns in trajectory data. We model subtrajectory clustering as a set cover problem and measure similarity based on the Fréchet distance. Given a polygonal curve with n vertices, the goal is to find the smallest set of center curves of complexity l such that each point on the input curve is part of a subcurve that has Fréchet distance of at most a given Delta to at least one of the center curves. We design bicriterial-approximation algorithms for this NP-hard problem. If there exists a solution of size k, then our algorithms find solutions of size O(k l log(k) log(l)) that solve the problem under distance O(Delta). The expected running time and space requirement of our algorithms is polynomial in k, l, n, 1/Delta and the arclength of the input curve. Our approach uses a variation of the multiplicative weight update method on a simplified version of the problem.
The second clustering problem that we study is curve averaging: the problem of optimizing the center curve for a fixed set of curves. In particular, we study a widely used heuristic for curve averaging under the dynamic time warping (DTW) distance called the DTW Barycenter Averaging (DBA) algorithm. The algorithm is very similar to the popular k-means algorithm. Given an initial center curve, it alternates between assignment and update steps until convergence. We study the number of iterations that DBA performs until it converges to a local optimal solution. We assume that DBA is given n polygonal curves of m points in R^d and a parameter k that specifies the length of the average curve to be computed. We conduct experiments that support the general view that DBA converges fast in practice. In contrast, we show that in the worst-case the number of iterations can be exponential in k. This gap between practical performance and worst-case analysis suggests that the worst-case behaviour is likely degenerate. To analyze the number of iterations on non-degenerate input, we further study DBA in the model of smoothed analysis. This model is based on bounding the expected number of iterations in the worst-case under random perturbation of the input. We achieve a bound that is polynomial in k, n and d, and for constant n/d is also polynomial in m.
Additionally, we study the complexity of range spaces underlying clustering problems where ranges are balls that are implicitly given by a center and a radius Delta and include all elements that are at a distance of at most Delta to the center. As distance measures, we consider Hausdorff distance, Fréchet distance and DTW. As centers and elements of the ground set, we consider polygonal curves in R^d and polygonal regions in R^2. To measure the complexity, we bound the VC-dimension and the shattering dimension of the resulting range spaces. Our approach is based on splitting range queries for the considered range spaces into simple predicates that can be determined by sign values of polynomials. This enables us to bound the VC-dimension and shattering dimension based on the number of cells in the arrangement of zero sets of these polynomials.},
url = {https://hdl.handle.net/20.500.11811/12636}
}

The following license files are associated with this item: