Zur Kurzanzeige

Distributed Anomaly Detection on Large Knowledge Graphs

dc.contributor.advisorLehmann, Jens
dc.contributor.authorBakhshandegan Moghaddam, Farshad
dc.date.accessioned2024-11-12T10:24:23Z
dc.date.available2024-11-12T10:24:23Z
dc.date.issued12.11.2024
dc.identifier.urihttps://hdl.handle.net/20.500.11811/12546
dc.description.abstractDigitization has yielded vast data, known as Big Data, fostering data analysis. As this data comes from various sources and is of diverse types, data integration techniques become essential in making analytics more accessible and effective. Knowledge Graphs (KGs) are vital in linking diverse data within a directed multi-graph, utilizing unique resource identifiers. Presently, over 10,000 datasets conform to Semantic Web standards, spanning fields like life sciences, industries, and the Internet of Things. KGs employ various creation approaches, including crowd-sourcing, natural language processing, and knowledge-extraction tools. However, the data used as input is often unvalidated and not cross-checked, making KGs vulnerable to errors at both logical and semantic levels. These errors can manifest across individual triples, impacting the subject, predicate, or object components of the RDF (Resource Description Framework) data, or even happen in relationships across triples, compromising the overall quality of KGs. Detecting these errors is not a trivial task because of the complex structure and the sheer size of modern large-scale KG data which easily surpasses the available memory capacity of current computers (e.g. English DBpedia size is $sim114$ GB). Furthermore, in the majority of cases, there are no defined rules to determine whether entered data is deemed correct or incorrect. The primary objective of this thesis is to identify errors in very large knowledge graphs in a scalable manner without prior knowledge of ground truth. To achieve this, we employ Anomaly Detection (AD), a branch of data mining, to identify errors in KGs. However, most of the traditional AD algorithms are no longer directly applicable to KGs due to the scalability issue and the RDF data complex structure. This thesis endeavors to integrate communication, synchronization, and distribution techniques with AD methods. Like most machine learning techniques, AD necessitates fixed-length numeric feature vectors. Yet, within the context of KGs, there is no native representation available in fixed-length numeric feature vectors. As a preliminary step, we have proposed a methodology to create fixed-length numeric feature vectors by extracting features from the graph via map-reduce operations. Accordingly, we have developed methods that enable SPARQL-based (SPARQL Protocol and RDF Query Language) feature extraction. Subsequently, we have introduced a scalable anomaly detection framework that can directly identify anomalies in RDF data. Moreover, to improve the transparency of the framework's output, we have provided human-readable explanations to assist users in understanding why detected anomalies should be considered as such. In addition, due to the technological complexity, we have enabled the application of our methods through complementary work, such as integrating them into coding notebooks and REST (Representational State Transfer) API-based environments. Finally, we have extended the existing technology stack SANSA through several scientific publications and software releases, to offer these functionalities to the Semantic Web community.en
dc.language.isoeng
dc.rightsIn Copyright
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/
dc.subject.ddc004 Informatik
dc.titleDistributed Anomaly Detection on Large Knowledge Graphs
dc.typeDissertation oder Habilitation
dc.publisher.nameUniversitäts- und Landesbibliothek Bonn
dc.publisher.locationBonn
dc.rights.accessRightsopenAccess
dc.identifier.urnhttps://nbn-resolving.org/urn:nbn:de:hbz:5-79643
dc.relation.doihttps://doi.org/10.3233/SSW210036
dc.relation.doihttps://doi.org/10.1109/ICSC52841.2022.00047
dc.relation.doihttps://doi.org/10.1109/ICSC56153.2023.00040
dc.relation.doihttps://doi.org/10.5281/zenodo.13123433
dc.relation.doihttps://doi.org/10.5281/zenodo.13123388
dc.relation.doihttps://doi.org/10.1109/AIKE59827.2023.00015
ulbbn.pubtypeErstveröffentlichung
ulbbnediss.affiliation.nameRheinische Friedrich-Wilhelms-Universität Bonn
ulbbnediss.affiliation.locationBonn
ulbbnediss.thesis.levelDissertation
ulbbnediss.dissID7964
ulbbnediss.date.accepted16.10.2024
ulbbnediss.instituteMathematisch-Naturwissenschaftliche Fakultät : Fachgruppe Informatik / Institut für Informatik
ulbbnediss.fakultaetMathematisch-Naturwissenschaftliche Fakultät
dc.contributor.coRefereeBauckhage, Christian


Dateien zu dieser Ressource

Thumbnail

Das Dokument erscheint in:

Zur Kurzanzeige

Die folgenden Nutzungsbestimmungen sind mit dieser Ressource verbunden:

InCopyright