Bakhshandegan Moghaddam, Farshad: Distributed Anomaly Detection on Large Knowledge Graphs. - Bonn, 2024. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-79643
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-79643
@phdthesis{handle:20.500.11811/12546,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-79643,
author = {{Farshad Bakhshandegan Moghaddam}},
title = {Distributed Anomaly Detection on Large Knowledge Graphs},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2024,
month = nov,
note = {Digitization has yielded vast data, known as Big Data, fostering data analysis. As this data comes from various sources and is of diverse types, data integration techniques become essential in making analytics more accessible and effective. Knowledge Graphs (KGs) are vital in linking diverse data within a directed multi-graph, utilizing unique resource identifiers. Presently, over 10,000 datasets conform to Semantic Web standards, spanning fields like life sciences, industries, and the Internet of Things. KGs employ various creation approaches, including crowd-sourcing, natural language processing, and knowledge-extraction tools. However, the data used as input is often unvalidated and not cross-checked, making KGs vulnerable to errors at both logical and semantic levels. These errors can manifest across individual triples, impacting the subject, predicate, or object components of the RDF (Resource Description Framework) data, or even happen in relationships across triples, compromising the overall quality of KGs. Detecting these errors is not a trivial task because of the complex structure and the sheer size of modern large-scale KG data which easily surpasses the available memory capacity of current computers (e.g. English DBpedia size is $sim114$ GB). Furthermore, in the majority of cases, there are no defined rules to determine whether entered data is deemed correct or incorrect. The primary objective of this thesis is to identify errors in very large knowledge graphs in a scalable manner without prior knowledge of ground truth. To achieve this, we employ Anomaly Detection (AD), a branch of data mining, to identify errors in KGs. However, most of the traditional AD algorithms are no longer directly applicable to KGs due to the scalability issue and the RDF data complex structure. This thesis endeavors to integrate communication, synchronization, and distribution techniques with AD methods. Like most machine learning techniques, AD necessitates fixed-length numeric feature vectors. Yet, within the context of KGs, there is no native representation available in fixed-length numeric feature vectors. As a preliminary step, we have proposed a methodology to create fixed-length numeric feature vectors by extracting features from the graph via map-reduce operations. Accordingly, we have developed methods that enable SPARQL-based (SPARQL Protocol and RDF Query Language) feature extraction. Subsequently, we have introduced a scalable anomaly detection framework that can directly identify anomalies in RDF data. Moreover, to improve the transparency of the framework's output, we have provided human-readable explanations to assist users in understanding why detected anomalies should be considered as such. In addition, due to the technological complexity, we have enabled the application of our methods through complementary work, such as integrating them into coding notebooks and REST (Representational State Transfer) API-based environments. Finally, we have extended the existing technology stack SANSA through several scientific publications and software releases, to offer these functionalities to the Semantic Web community.},
url = {https://hdl.handle.net/20.500.11811/12546}
}
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-79643,
author = {{Farshad Bakhshandegan Moghaddam}},
title = {Distributed Anomaly Detection on Large Knowledge Graphs},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2024,
month = nov,
note = {Digitization has yielded vast data, known as Big Data, fostering data analysis. As this data comes from various sources and is of diverse types, data integration techniques become essential in making analytics more accessible and effective. Knowledge Graphs (KGs) are vital in linking diverse data within a directed multi-graph, utilizing unique resource identifiers. Presently, over 10,000 datasets conform to Semantic Web standards, spanning fields like life sciences, industries, and the Internet of Things. KGs employ various creation approaches, including crowd-sourcing, natural language processing, and knowledge-extraction tools. However, the data used as input is often unvalidated and not cross-checked, making KGs vulnerable to errors at both logical and semantic levels. These errors can manifest across individual triples, impacting the subject, predicate, or object components of the RDF (Resource Description Framework) data, or even happen in relationships across triples, compromising the overall quality of KGs. Detecting these errors is not a trivial task because of the complex structure and the sheer size of modern large-scale KG data which easily surpasses the available memory capacity of current computers (e.g. English DBpedia size is $sim114$ GB). Furthermore, in the majority of cases, there are no defined rules to determine whether entered data is deemed correct or incorrect. The primary objective of this thesis is to identify errors in very large knowledge graphs in a scalable manner without prior knowledge of ground truth. To achieve this, we employ Anomaly Detection (AD), a branch of data mining, to identify errors in KGs. However, most of the traditional AD algorithms are no longer directly applicable to KGs due to the scalability issue and the RDF data complex structure. This thesis endeavors to integrate communication, synchronization, and distribution techniques with AD methods. Like most machine learning techniques, AD necessitates fixed-length numeric feature vectors. Yet, within the context of KGs, there is no native representation available in fixed-length numeric feature vectors. As a preliminary step, we have proposed a methodology to create fixed-length numeric feature vectors by extracting features from the graph via map-reduce operations. Accordingly, we have developed methods that enable SPARQL-based (SPARQL Protocol and RDF Query Language) feature extraction. Subsequently, we have introduced a scalable anomaly detection framework that can directly identify anomalies in RDF data. Moreover, to improve the transparency of the framework's output, we have provided human-readable explanations to assist users in understanding why detected anomalies should be considered as such. In addition, due to the technological complexity, we have enabled the application of our methods through complementary work, such as integrating them into coding notebooks and REST (Representational State Transfer) API-based environments. Finally, we have extended the existing technology stack SANSA through several scientific publications and software releases, to offer these functionalities to the Semantic Web community.},
url = {https://hdl.handle.net/20.500.11811/12546}
}