Distributed Anomaly Detection on Large Knowledge Graphs

Bakhshandegan Moghaddam, Farshad

dc.contributor.advisor	Lehmann, Jens
dc.contributor.author	Bakhshandegan Moghaddam, Farshad
dc.date.accessioned	2024-11-12T10:24:23Z
dc.date.available	2024-11-12T10:24:23Z
dc.date.issued	12.11.2024
dc.identifier.uri	https://hdl.handle.net/20.500.11811/12546
dc.description.abstract	Digitization has yielded vast data, known as Big Data, fostering data analysis. As this data comes from various sources and is of diverse types, data integration techniques become essential in making analytics more accessible and effective. Knowledge Graphs (KGs) are vital in linking diverse data within a directed multi-graph, utilizing unique resource identifiers. Presently, over 10,000 datasets conform to Semantic Web standards, spanning fields like life sciences, industries, and the Internet of Things. KGs employ various creation approaches, including crowd-sourcing, natural language processing, and knowledge-extraction tools. However, the data used as input is often unvalidated and not cross-checked, making KGs vulnerable to errors at both logical and semantic levels. These errors can manifest across individual triples, impacting the subject, predicate, or object components of the RDF (Resource Description Framework) data, or even happen in relationships across triples, compromising the overall quality of KGs. Detecting these errors is not a trivial task because of the complex structure and the sheer size of modern large-scale KG data which easily surpasses the available memory capacity of current computers (e.g. English DBpedia size is $sim114$ GB). Furthermore, in the majority of cases, there are no defined rules to determine whether entered data is deemed correct or incorrect. The primary objective of this thesis is to identify errors in very large knowledge graphs in a scalable manner without prior knowledge of ground truth. To achieve this, we employ Anomaly Detection (AD), a branch of data mining, to identify errors in KGs. However, most of the traditional AD algorithms are no longer directly applicable to KGs due to the scalability issue and the RDF data complex structure. This thesis endeavors to integrate communication, synchronization, and distribution techniques with AD methods. Like most machine learning techniques, AD necessitates fixed-length numeric feature vectors. Yet, within the context of KGs, there is no native representation available in fixed-length numeric feature vectors. As a preliminary step, we have proposed a methodology to create fixed-length numeric feature vectors by extracting features from the graph via map-reduce operations. Accordingly, we have developed methods that enable SPARQL-based (SPARQL Protocol and RDF Query Language) feature extraction. Subsequently, we have introduced a scalable anomaly detection framework that can directly identify anomalies in RDF data. Moreover, to improve the transparency of the framework's output, we have provided human-readable explanations to assist users in understanding why detected anomalies should be considered as such. In addition, due to the technological complexity, we have enabled the application of our methods through complementary work, such as integrating them into coding notebooks and REST (Representational State Transfer) API-based environments. Finally, we have extended the existing technology stack SANSA through several scientific publications and software releases, to offer these functionalities to the Semantic Web community.	en
dc.language.iso	eng
dc.rights	In Copyright
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/
dc.subject.ddc	004 Informatik
dc.title	Distributed Anomaly Detection on Large Knowledge Graphs
dc.type	Dissertation oder Habilitation
dc.publisher.name	Universitäts- und Landesbibliothek Bonn
dc.publisher.location	Bonn
dc.rights.accessRights	openAccess
dc.identifier.urn	https://nbn-resolving.org/urn:nbn:de:hbz:5-79643
dc.relation.doi	https://doi.org/10.3233/SSW210036
dc.relation.doi	https://doi.org/10.1109/ICSC52841.2022.00047
dc.relation.doi	https://doi.org/10.1109/ICSC56153.2023.00040
dc.relation.doi	https://doi.org/10.5281/zenodo.13123433
dc.relation.doi	https://doi.org/10.5281/zenodo.13123388
dc.relation.doi	https://doi.org/10.1109/AIKE59827.2023.00015
ulbbn.pubtype	Erstveröffentlichung
ulbbnediss.affiliation.name	Rheinische Friedrich-Wilhelms-Universität Bonn
ulbbnediss.affiliation.location	Bonn
ulbbnediss.thesis.level	Dissertation
ulbbnediss.dissID	7964
ulbbnediss.date.accepted	16.10.2024
ulbbnediss.institute	Mathematisch-Naturwissenschaftliche Fakultät : Fachgruppe Informatik / Institut für Informatik
ulbbnediss.fakultaet	Mathematisch-Naturwissenschaftliche Fakultät
dc.contributor.coReferee	Bauckhage, Christian

Files in this item

Name:: 7964.pdf
Size:: 9.6MB
Format:: PDF

View/Open

This item appears in the following Collection(s)

E-Dissertationen (4474)

Show simple item record

The following license files are associated with this item: