Mining Frequent Itemsets from Transactional Data Streams with Probabilistic Error Bounds

Trabold, Daniel

Volltext

Dokument öffnen (9.7MB)

Autor

Trabold, Daniel

Art der Hochschulschrift

Dissertation

Prüfungsdatum

18.03.2020

Datum der Veröffentlichung

27.05.2020

Erstgutachter

Wrobel, Stefan

Zweitgutachter

Bauckhage, Christian

Beteiligte Institutionen

Rheinische Friedrich-Wilhelms-Universität Bonn

Metadaten

Zur Langanzeige

Zitierbare Links

Handle: https://hdl.handle.net/20.500.11811/8383
URN: https://nbn-resolving.org/urn:nbn:de:hbz:5-58198

Inhalt

Frequent itemset mining is a classical data mining task with a broad range of applications, including fraud discovery and product recommendation. The enumeration of frequent itemsets has two main benefits for such applications: First, frequent itemsets provide a human-understandable representation of knowledge. This is crucial as human experts are involved in designing systems for these applications. Second, many efficient algorithms are known for mining frequent itemsets. This is essential as many of today’s realworld applications produce ever-growing data streams. Examples of these are online shopping, electronic payment or phone call transactions. With limited physical main memory, the analysis of data streams can, in general, be only approximate. State-ofthe-art algorithms for frequent itemset mining from such streams bound their error by processing the transactions in blocks of fixed size, either each transaction individually or in mini-batches. In theory, single transaction-based updates provide the most up-todate result after each transaction, but this enumeration is inefficient in practice as the number of frequent itemsets for a single transaction can be exponential in its cardinality. Mini-batch-based algorithms are faster but can only produce a new result at the end of each batch. In this thesis, the binary choice between up-to-date results and speed is eliminated. To provide more flexibility, we develop new algorithms with a probabilistic error bound that can process an arbitrary number of transactions in each batch.
State-of-the-art algorithms mining frequent itemsets from data streams with minibatches derive the size of the mini-batch from a user-defined error parameter and hence couple their error bound to the size of the update. By introducing a dynamic error bound that adapts to the length of the data stream the error is decoupled from the size of the update. The benefits of this approach are twofold: First, the dynamic error bound is independent of the size of the update. Hence, an arbitrary number of transactions can be processed without losing the error bound. Second, the bound becomes tighter as more transactions arrive and thus the tolerated error decreases, in contrast to algorithms with static thresholds. Our approach is extensively compared to the state-of-the-art in an empirical evaluation. The results confirm that the dynamic approach is not only more flexible but also outperforms the state-of-the-art in terms of F-score for a large number of data streams.
As it is easier for experts to extract knowledge from a smaller collection, we consider mining a compact pattern set. Especially useful are parameterized pattern classes for which the expert can regulate the size of the output. An example of such a parameterized pattern class are strongly closed itemsets. Additionally, they are stable against small changes in the data stream. We present an algorithm mining strongly closed itemsets from data streams. It builds on reservoir sampling and is thus capable of producing a result after any number of transactions, once the initial sample is complete. The high approximation quality of the algorithm is empirically demonstrated and the potential of strongly closed patterns for two stream mining tasks is shown: concept drift detection and product configuration recommendation.

Schlagwörter

Häufige Itemsets, Datenströme, Dynamische Fehlerschranken, Probabilistische Garantien, frequent itemsets, data streams, strongly closed itemsets, dynamic error bounds, propabilistic guarantees

Klassifikation (DDC)

004 Informatik

Zitiervorschlag
BibTeX

Trabold, Daniel: Mining Frequent Itemsets from Transactional Data Streams with Probabilistic Error Bounds. - Bonn, 2020. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-58198

@phdthesis{handle:20.500.11811/8383,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-58198,
author = {{Daniel Trabold}},
title = {Mining Frequent Itemsets from Transactional Data Streams with Probabilistic Error Bounds},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2020,
month = may,
note = {Frequent itemset mining is a classical data mining task with a broad range of applications, including fraud discovery and product recommendation. The enumeration of frequent itemsets has two main benefits for such applications: First, frequent itemsets provide a human-understandable representation of knowledge. This is crucial as human experts are involved in designing systems for these applications. Second, many efficient algorithms are known for mining frequent itemsets. This is essential as many of today’s realworld applications produce ever-growing data streams. Examples of these are online shopping, electronic payment or phone call transactions. With limited physical main memory, the analysis of data streams can, in general, be only approximate. State-ofthe-art algorithms for frequent itemset mining from such streams bound their error by processing the transactions in blocks of fixed size, either each transaction individually or in mini-batches. In theory, single transaction-based updates provide the most up-todate result after each transaction, but this enumeration is inefficient in practice as the number of frequent itemsets for a single transaction can be exponential in its cardinality. Mini-batch-based algorithms are faster but can only produce a new result at the end of each batch. In this thesis, the binary choice between up-to-date results and speed is eliminated. To provide more flexibility, we develop new algorithms with a probabilistic error bound that can process an arbitrary number of transactions in each batch.
State-of-the-art algorithms mining frequent itemsets from data streams with minibatches derive the size of the mini-batch from a user-defined error parameter and hence couple their error bound to the size of the update. By introducing a dynamic error bound that adapts to the length of the data stream the error is decoupled from the size of the update. The benefits of this approach are twofold: First, the dynamic error bound is independent of the size of the update. Hence, an arbitrary number of transactions can be processed without losing the error bound. Second, the bound becomes tighter as more transactions arrive and thus the tolerated error decreases, in contrast to algorithms with static thresholds. Our approach is extensively compared to the state-of-the-art in an empirical evaluation. The results confirm that the dynamic approach is not only more flexible but also outperforms the state-of-the-art in terms of F-score for a large number of data streams.
As it is easier for experts to extract knowledge from a smaller collection, we consider mining a compact pattern set. Especially useful are parameterized pattern classes for which the expert can regulate the size of the output. An example of such a parameterized pattern class are strongly closed itemsets. Additionally, they are stable against small changes in the data stream. We present an algorithm mining strongly closed itemsets from data streams. It builds on reservoir sampling and is thus capable of producing a result after any number of transactions, once the initial sample is complete. The high approximation quality of the algorithm is empirically demonstrated and the potential of strongly closed patterns for two stream mining tasks is shown: concept drift detection and product configuration recommendation.},
url = {https://hdl.handle.net/20.500.11811/8383}
}

Die folgenden Nutzungsbestimmungen sind mit dieser Ressource verbunden: